DNA
AIOSDNA

operations_manual-02_runbooks

Operational Runbooks

Runbook: Deploy New Customer Cluster

Prerequisites

  • Customer contract signed
  • Provisioning plan created
  • Resources allocated

Steps

  1. Create Provisioning Plan

    bash
    POST /commercial/provisioning/create
    {
    "tenant_id": "customer_123",
    "customer_name": "Acme Corp",
    "requested_domains": ["trading", "llm", "security"],
    "estimated_load": {"trading": 1000, "llm": 5000}
    }
  2. Review Plan

    • Verify resource allocation
    • Check quota limits
    • Confirm regions
  3. Execute Provisioning

    bash
    POST /commercial/provisioning/{plan_id}/execute
  4. Validate Deployment

    bash
    GET /commercial/production/health/{tenant_id}
  5. Run Onboarding Flow

    bash
    POST /onboarding/create
    POST /onboarding/{flow_id}/execute
  6. Post-Deployment Checks

    • All domains accessible
    • Health checks passing
    • Backups configured
    • Monitoring enabled
    • Customer notified

Rollback

If provisioning fails:

  1. Review error logs
  2. Clean up partial resources
  3. Notify customer
  4. Create new plan with fixes

Runbook: Handle Production Incident

Detection

  • Alert received via PagerDuty/Slack
  • Customer reports issue
  • System health dashboard shows degradation

Classification

Severity Levels:

  • Critical: System down, data loss, security breach
  • High: Major feature unavailable, SLA violation
  • Medium: Degraded performance, minor feature issues
  • Low: Cosmetic issues, non-critical bugs

Response Steps

  1. Acknowledge Incident

    bash
    POST /incidents/detect
    {
    "title": "LLM Kernel Latency Spike",
    "description": "P95 latency > 5s",
    "category": "latency",
    "affected_domains": ["llm"],
    "severity": "high"
    }
  2. Assess Impact

    • Check affected tenants
    • Review error rates
    • Check SLO violations
    • Estimate customer impact
  3. Mitigate

    bash
    POST /incidents/{incident_id}/mitigate
    {
    "mitigation_steps": [
    "Scale up LLM kernel nodes",
    "Enable circuit breaker",
    "Route traffic to backup region"
    ]
    }
  4. Verify Resolution

    • Monitor metrics for 15 minutes
    • Verify SLOs restored
    • Confirm customer impact resolved
  5. Close Incident

    bash
    POST /incidents/{incident_id}/close

Escalation

  • Critical: Escalate immediately to on-call engineer + manager
  • High: Escalate if not resolved in 30 minutes
  • Medium: Escalate if not resolved in 2 hours
  • Low: Escalate if not resolved in 24 hours

Runbook: Execute Customer Onboarding

Pre-Onboarding

  • Contract signed
  • Customer data collected
  • Technical requirements documented

Onboarding Steps

  1. Create Onboarding Flow

    bash
    POST /onboarding/create
    {
    "customer_type": "enterprise",
    "customer_name": "Acme Corp",
    "customer_id": "acme_001"
    }
  2. Execute Flow

    bash
    POST /onboarding/{flow_id}/execute
  3. Verify Health Checks

    • Connectivity: ✅
    • Permissions: ✅
    • Agents ready: ✅
    • Workflows validated: ✅
  4. Issue API Keys

    • Generate 3 API keys
    • Provide to customer
    • Document in customer portal
  5. Customer Training

    • Schedule training session
    • Provide documentation
    • Set up support channels
  6. Go-Live

    • Enable production traffic
    • Monitor for 24 hours
    • Daily check-ins for first week

Runbook: Run PoV for Prospect

Pre-PoV

  • Prospect qualified
  • Use case identified
  • Sample data collected

PoV Execution

  1. Generate PoV Kit

    bash
    POST /pov/generate
    {
    "prospect_name": "TechCorp",
    "prospect_id": "techcorp_pov",
    "capabilities": ["trading", "llm"],
    "sample_data": {
    "tickets": [...],
    "trading_history": [...]
    }
    }
  2. Deploy PoV

    bash
    POST /pov/{deployment_id}/deploy
  3. Execute Playbook

    bash
    POST /pov/playbooks/execute
    {
    "tenant_id": "techcorp_pov",
    "playbook_id": "trading_playbook"
    }
  4. Monitor Progress

    • Daily check-ins
    • Track KPI progress
    • Document improvements
  5. Generate Report

    bash
    GET /pov/report/{tenant_id}
  6. Present Results

    • Schedule demo
    • Present ROI findings
    • Answer questions

Runbook: Execute Major Upgrade

Pre-Upgrade

  1. Validate Upgrade

    bash
    POST /upgrade/validate
    {
    "upgrade_type": "kernel",
    "upgrade_target": "llm_kernel_v2",
    "current_version": "v1.0",
    "target_version": "v2.0"
    }
  2. Review Validation

    • Check compatibility issues
    • Review risk assessment
    • Get approval if risk > 0.5
  3. Create Upgrade Plan

    bash
    POST /upgrade/plan
    {
    "validation_id": "...",
    "strategy": "canary"
    }

Upgrade Execution

  1. Notify Customers (if required)

    • Send maintenance window notice
    • Update status page
  2. Execute Canary Deployment

    • Deploy to 10% of traffic
    • Monitor for 1 hour
    • Check metrics and errors
  3. Expand Deployment

    • Increase to 50% if canary passes
    • Monitor for 1 hour
    • Full rollout if stable
  4. Verify Upgrade

    • Check all health metrics
    • Verify SLO compliance
    • Confirm customer impact
  5. Rollback if Needed

    • Execute rollback plan
    • Restore previous version
    • Investigate issues

Runbook: Handle SLA Violation

Detection

  • SLA violation alert received
  • Customer reports SLA breach
  • Automated detection triggers

Response

  1. Acknowledge Violation

    bash
    GET /sla/report/{tenant_id}
  2. Assess Impact

    • Review violation details
    • Calculate service credits
    • Estimate customer impact
  3. Mitigate

    • Address root cause
    • Restore SLA compliance
    • Document resolution
  4. Compensate

    • Calculate service credits
    • Apply to next invoice
    • Notify customer
  5. Prevent Recurrence

    • Update monitoring
    • Adjust thresholds
    • Improve processes

Runbook: Generate Monthly Invoice

Pre-Invoicing

  1. Collect Usage Data

    • API calls
    • Workflow executions
    • Storage usage
    • Token consumption
  2. Calculate Charges

    • Subscription fees
    • Usage-based charges
    • Marketplace revenue
    • Discounts

Invoice Generation

  1. Generate Invoice

    bash
    GET /billing/invoice/{tenant_id}
    {
    "period_start": "2024-01-01",
    "period_end": "2024-01-31"
    }
  2. Review Invoice

    • Verify line items
    • Check calculations
    • Apply discounts
  3. Send to Customer

    • Email invoice
    • Update customer portal
    • Send payment reminder
  4. Sync to Payment Processor

    bash
    POST /billing/invoice/{invoice_id}/sync_stripe
  5. Track Payment

    • Monitor payment status
    • Follow up on overdue
    • Update accounting system

Runbook: Customer Expansion

Detection

  • High ROI detected
  • Usage approaching limits
  • Customer requests expansion

Expansion Process

  1. Identify Opportunity

    bash
    GET /customer/lifecycle/{tenant_id}
    # Check upsell_opportunities
  2. Prepare Proposal

    • Review current usage
    • Calculate expansion value
    • Prepare technical proposal
  3. Present to Customer

    • Schedule meeting
    • Present ROI data
    • Discuss expansion options
  4. Execute Expansion

    bash
    POST /customer/lifecycle/{lifecycle_id}/expand
    {
    "expansion_value": 50000.0
    }
  5. Update Configuration

    • Increase quotas
    • Enable new features
    • Update billing

Runbook: Disaster Recovery

Scenario: Region Failure

  1. Detect Failure

    • Region health check fails
    • Customer reports unavailable
    • Monitoring alerts
  2. Activate Failover

    bash
    POST /global/failover/execute
    {
    "source_region": "us-east-1",
    "target_region": "us-west-2"
    }
  3. Verify Failover

    • Check target region health
    • Verify traffic routing
    • Confirm customer access
  4. Monitor Recovery

    • Track metrics for 24 hours
    • Verify SLO compliance
    • Document downtime
  5. Post-Mortem

    • Root cause analysis
    • Update failover procedures
    • Improve monitoring

Runbook: Security Incident Response

Detection

  • Security alert from SentinelX
  • Customer reports security issue
  • Penetration test finds vulnerability

Response

  1. Classify Threat

    • Severity assessment
    • Impact analysis
    • Affected systems
  2. Contain Threat

    bash
    POST /security/incident/contain
    {
    "threat_id": "...",
    "actions": ["block_ip", "isolate_tenant"]
    }
  3. Investigate

    • Review security logs
    • Analyze attack vector
    • Identify root cause
  4. Remediate

    • Patch vulnerabilities
    • Update security rules
    • Rotate credentials
  5. Notify

    • Internal team
    • Affected customers (if required)
    • Compliance team
  6. Document

    • Incident report
    • Lessons learned
    • Process improvements

Was this helpful?