Platform

Health, resources, logs and alerts for every platform service — one view.

Refresh rate

iom-cluster ⚠ Outage

Spark cluster service · namespace iomete-system

Overview

Metrics

Logs

Resources

Alerts

●StatusOutage — restarting

↻Restarts (24h)7

☷Imageiom-cluster:4.2.1

↑Uptime92.4%

◫Replicas1 / 1 ready

💾Memory3.6 GiB / 4.0 GiB

⚡CPU1.7 / 2.0 cores

🔔Active alerts2 firing

Recent events

14:32:08RESTARTBack-off restarting failed container — OOMKilled

14:31:50SATURATIONMemory usage exceeded 90% of limit for 60s

14:28:11LATENCYspark/jobs/{id}/runs p99 latency above 1500ms

14:19:02SCALEResources updated: memory limit 2.0 → 4.0 GiB (by abhishek)

CPU usage ⓘ

85.2% / 100% · 1.7 / 2.0 cores

Memory ⓘ

90.1% / 100% · 3.6 / 4.0 GiB

Requests / sec ⓘ

129.8 req/s peak

Restarts (24h) ⓘ

7 restarts

Throughput

spark app submissions · last 14d (Prometheus)

API latency

spark/jobs/{id}/runs · P50 / P90 / P99

P50P90P99

streaming from kubernetes · iom-cluster-0

ℹ Live tail reads directly from the Kubernetes API. Historical search across restarts (Loki) — coming soon.

⚠ Saving changes will restart the pod. The new limits are stored in the platform config DB and re-applied on every deploy.

Last changed 14:19 today by abhishek · memory limit 2000 → 4000 MiB. Full history in Audit Logs.

Alert rules · iom-cluster

Rule	Signal	Threshold	Channel	State
High restart count	restarts (24h)	> 3	platform-oncall@	Firing
Memory saturation	mem / limit	> 90% · 60s	platform-oncall@	Firing
API latency p99	p99 latency	> 1500ms	platform-oncall@	OK

Alerts

Rules we own and ship — evaluated by the platform service, not customer Grafana.

Rules

Rule	Signal	Target	Threshold	Channel	State	Last fired
High restart count	restarts (24h)	iom-cluster	> 3	platform-oncall@	Firing	just now
Memory saturation	mem / limit	iom-cluster	> 90% · 60s	platform-oncall@	Firing	2m ago
Submission spike →	spark submits/min	all services	> 3× 7d avg	platform-oncall@	Pending	18m ago
API latency p99	p99 latency	iom-core	> 1500ms	#platform-alerts	OK	3h ago
Typesense disk	disk used	typesense	> 85%	#platform-alerts	OK	1d ago
Pod not ready	ready replicas	all services	< desired · 2m	platform-oncall@	OK	2d ago

Attribution

Who is driving API load — queried live from the platform_event_logs Iceberg table.

Time range

Group by

Endpoint

Method

Status

Run on

runs as a Spark query on the selected cluster

Generated SQL · composed from the filters above

By user

Principal	Requests	% of total	Success	Failed	Last seen

⚠ Today platform_event_logs captures user_id · service · action · success · occurred_at. Latency percentiles (P90/P99), internal-vs-external caller, and team/domain need schema enrichment before they can be attributed (see plan).

Guardrails

Caps per user, team, domain, or endpoint — a safety ceiling so one principal can't degrade the platform. Enforced at the gateway, backed by the ratelimiter service.

Rate limits

Service

Limit	Service	Scope	Applies to	Cap	Action	Mode	State	Peak (1h)

⚠ Caps are only defensible once load testing establishes "with X resources we support Y req/min" — that's the hard prerequisite. user_id and endpoint scoping work against platform_event_logs today; team / domain / caller-type scoping needs the same schema enrichment as Attribution. Treat Monitor mode as the rollout path: watch would-be breaches before enforcing.

Platform

iom-cluster ⚠ Outage

Recent events

System metrics

Throughput

API latency

Alerts

Attribution

Guardrails