AIOS DNA

observability-slos

Observability System SLOs

Service Level Objectives

The observability system defines the following SLOs to ensure production readiness:

Ingest SLO

  • Target: 99% of ingest requests complete in < 200ms
  • Measurement: Time from HTTP request received to 202 response sent
  • Monitoring: aiosx_observability_db_write_latency_seconds histogram

Query SLO

  • Target: 99% of query requests complete in < 700ms
  • Measurement: Time from query request to response returned
  • Monitoring: Query endpoint response time metrics

Write SLO

  • Target: 99% of events written to DB within 5 seconds of arrival
  • Measurement: Time from event ingestion to successful DB write
  • Monitoring: aiosx_observability_db_write_latency_seconds histogram

SLO Validation

Unit Tests

  • Schema validation tests
  • Normalizer tests with various event types
  • Redaction tests with different modes

Integration Tests

  • Docker Compose test environment (Postgres + Kafka)
  • End-to-end event flow tests
  • Multi-tenant isolation tests

Load Tests

  • Scenario: 2k events/sec for 10 minutes
  • Metrics:
    • Ingest latency (p50, p95, p99)
    • Query latency (p50, p95, p99)
    • Error rate
    • Throughput

Monitoring

Key Metrics

  • aiosx_observability_events_ingested_total - Total events ingested
  • aiosx_observability_events_failed_total - Total failed events
  • aiosx_observability_db_write_latency_seconds - DB write latency
  • aiosx_observability_db_write_retry_total - Retry attempts
  • aiosx_observability_dlq_events_total - Events in DLQ

Alerts

  • SLO breach: > 1% of requests exceed SLO targets
  • High error rate: > 1% of events fail
  • DLQ growth: DLQ size > 1000 events
  • High latency: p99 latency > 2x SLO target

Was this helpful?