observability-slos
Observability System SLOs
Service Level Objectives
The observability system defines the following SLOs to ensure production readiness:
Ingest SLO
- Target: 99% of ingest requests complete in < 200ms
- Measurement: Time from HTTP request received to 202 response sent
- Monitoring:
aiosx_observability_db_write_latency_secondshistogram
Query SLO
- Target: 99% of query requests complete in < 700ms
- Measurement: Time from query request to response returned
- Monitoring: Query endpoint response time metrics
Write SLO
- Target: 99% of events written to DB within 5 seconds of arrival
- Measurement: Time from event ingestion to successful DB write
- Monitoring:
aiosx_observability_db_write_latency_secondshistogram
SLO Validation
Unit Tests
- Schema validation tests
- Normalizer tests with various event types
- Redaction tests with different modes
Integration Tests
- Docker Compose test environment (Postgres + Kafka)
- End-to-end event flow tests
- Multi-tenant isolation tests
Load Tests
- Scenario: 2k events/sec for 10 minutes
- Metrics:
- Ingest latency (p50, p95, p99)
- Query latency (p50, p95, p99)
- Error rate
- Throughput
Monitoring
Key Metrics
aiosx_observability_events_ingested_total- Total events ingestedaiosx_observability_events_failed_total- Total failed eventsaiosx_observability_db_write_latency_seconds- DB write latencyaiosx_observability_db_write_retry_total- Retry attemptsaiosx_observability_dlq_events_total- Events in DLQ
Alerts
- SLO breach: > 1% of requests exceed SLO targets
- High error rate: > 1% of events fail
- DLQ growth: DLQ size > 1000 events
- High latency: p99 latency > 2x SLO target