Platform
Health, resources, logs and alerts for every platform service — one view.
Refresh rate
iom-cluster ⚠ Outage
Spark cluster service · namespace iomete-system
Overview
Metrics
Logs
Resources
Alerts
●StatusOutage — restarting
↻Restarts (24h)7
☷Imageiom-cluster:4.2.1
↑Uptime92.4%
◫Replicas1 / 1 ready
💾Memory3.6 GiB / 4.0 GiB
⚡CPU1.7 / 2.0 cores
🔔Active alerts2 firing
Recent events
14:32:08RESTARTBack-off restarting failed container — OOMKilled
14:31:50SATURATIONMemory usage exceeded 90% of limit for 60s
14:28:11LATENCYspark/jobs/{id}/runs p99 latency above 1500ms
14:19:02SCALEResources updated: memory limit 2.0 → 4.0 GiB (by abhishek)
CPU usage ⓘ
85.2% / 100% · 1.7 / 2.0 cores
Memory ⓘ
90.1% / 100% · 3.6 / 4.0 GiB
Requests / sec ⓘ
129.8 req/s peak
Restarts (24h) ⓘ
7 restarts
Throughput
spark app submissions · last 14d (Prometheus)
API latency
spark/jobs/{id}/runs · P50 / P90 / P99
P50P90P99
streaming from kubernetes · iom-cluster-0
ℹ Live tail reads directly from the Kubernetes API. Historical search across restarts (Loki) — coming soon.
⚠ Saving changes will restart the pod. The new limits are stored in the platform config DB and re-applied on every deploy.
Last changed 14:19 today by abhishek · memory limit 2000 → 4000 MiB. Full history in Audit Logs.
Alert rules · iom-cluster
| Rule | Signal | Threshold | Channel | State |
|---|---|---|---|---|
| High restart count | restarts (24h) | > 3 | platform-oncall@ | Firing |
| Memory saturation | mem / limit | > 90% · 60s | platform-oncall@ | Firing |
| API latency p99 | p99 latency | > 1500ms | platform-oncall@ | OK |
Alerts
Rules we own and ship — evaluated by the platform service, not customer Grafana.
Rules
| Rule | Signal | Target | Threshold | Channel | State | Last fired |
|---|---|---|---|---|---|---|
| High restart count | restarts (24h) | iom-cluster | > 3 | platform-oncall@ | Firing | just now |
| Memory saturation | mem / limit | iom-cluster | > 90% · 60s | platform-oncall@ | Firing | 2m ago |
| Submission spike → | spark submits/min | all services | > 3× 7d avg | platform-oncall@ | Pending | 18m ago |
| API latency p99 | p99 latency | iom-core | > 1500ms | #platform-alerts | OK | 3h ago |
| Typesense disk | disk used | typesense | > 85% | #platform-alerts | OK | 1d ago |
| Pod not ready | ready replicas | all services | < desired · 2m | platform-oncall@ | OK | 2d ago |
Attribution
Who is driving API load — queried live from the platform_event_logs Iceberg table.
Generated SQL · composed from the filters above
By user
| Principal | Requests | % of total | Success | Failed | Last seen |
|---|
⚠ Today platform_event_logs captures user_id · service · action · success · occurred_at. Latency percentiles (P90/P99), internal-vs-external caller, and team/domain need schema enrichment before they can be attributed (see plan).
Guardrails
Caps per user, team, domain, or endpoint — a safety ceiling so one principal can't degrade the platform. Enforced at the gateway, backed by the ratelimiter service.
Rate limits
| Limit | Service | Scope | Applies to | Cap | Action | Mode | State | Peak (1h) |
|---|
⚠ Caps are only defensible once load testing establishes "with X resources we support Y req/min" — that's the hard prerequisite. user_id and endpoint scoping work against platform_event_logs today; team / domain / caller-type scoping needs the same schema enrichment as Attribution. Treat Monitor mode as the rollout path: watch would-be breaches before enforcing.