Platform
🔔

Platform

Health, resources, logs and alerts for every platform service — one view.
Refresh rate

iom-cluster ⚠ Outage

Spark cluster service · namespace iomete-system
Overview
Metrics
Logs
Resources
Alerts
StatusOutage — restarting
Restarts (24h)7
Imageiom-cluster:4.2.1
Uptime92.4%
Replicas1 / 1 ready
💾Memory3.6 GiB / 4.0 GiB
CPU1.7 / 2.0 cores
🔔Active alerts2 firing

Recent events

14:32:08RESTARTBack-off restarting failed container — OOMKilled
14:31:50SATURATIONMemory usage exceeded 90% of limit for 60s
14:28:11LATENCYspark/jobs/{id}/runs p99 latency above 1500ms
14:19:02SCALEResources updated: memory limit 2.0 → 4.0 GiB (by abhishek)

System metrics

Refresh rate
CPU usage ⓘ
85.2% / 100%  ·  1.7 / 2.0 cores
Memory ⓘ
90.1% / 100%  ·  3.6 / 4.0 GiB
Requests / sec ⓘ
129.8 req/s peak
Restarts (24h) ⓘ
7 restarts

Throughput

spark app submissions · last 14d (Prometheus)

API latency

spark/jobs/{id}/runs · P50 / P90 / P99
P50P90P99
streaming from kubernetes · iom-cluster-0
ℹ Live tail reads directly from the Kubernetes API. Historical search across restarts (Loki) — coming soon.
⚠ Saving changes will restart the pod. The new limits are stored in the platform config DB and re-applied on every deploy.
mCPU
mCPU
MiB
MiB
Number of pods for this service.
Disabled
Last changed 14:19 today by abhishek · memory limit 2000 → 4000 MiB. Full history in Audit Logs.
Alert rules · iom-cluster
RuleSignalThresholdChannelState
High restart countrestarts (24h)> 3platform-oncall@Firing
Memory saturationmem / limit> 90% · 60splatform-oncall@Firing
API latency p99p99 latency> 1500msplatform-oncall@OK

Alerts

Rules we own and ship — evaluated by the platform service, not customer Grafana.
Rules
RuleSignalTargetThresholdChannelStateLast fired
High restart countrestarts (24h)iom-cluster> 3platform-oncall@Firingjust now
Memory saturationmem / limitiom-cluster> 90% · 60splatform-oncall@Firing2m ago
Submission spike spark submits/minall services> 3× 7d avgplatform-oncall@Pending18m ago
API latency p99p99 latencyiom-core> 1500ms#platform-alertsOK3h ago
Typesense diskdisk usedtypesense> 85%#platform-alertsOK1d ago
Pod not readyready replicasall services< desired · 2mplatform-oncall@OK2d ago

Attribution

Who is driving API load — queried live from the platform_event_logs Iceberg table.
runs as a Spark query on the selected cluster
Generated SQL · composed from the filters above

        
By user
PrincipalRequests% of total SuccessFailedLast seen
⚠ Today platform_event_logs captures user_id · service · action · success · occurred_at. Latency percentiles (P90/P99), internal-vs-external caller, and team/domain need schema enrichment before they can be attributed (see plan).

Guardrails

Caps per user, team, domain, or endpoint — a safety ceiling so one principal can't degrade the platform. Enforced at the gateway, backed by the ratelimiter service.
Rate limits
LimitServiceScopeApplies toCapActionModeStatePeak (1h)
⚠ Caps are only defensible once load testing establishes "with X resources we support Y req/min" — that's the hard prerequisite. user_id and endpoint scoping work against platform_event_logs today; team / domain / caller-type scoping needs the same schema enrichment as Attribution. Treat Monitor mode as the rollout path: watch would-be breaches before enforcing.