operations_manual-02_runbooks
Operational Runbooks
Runbook: Deploy New Customer Cluster
Prerequisites
- Customer contract signed
- Provisioning plan created
- Resources allocated
Steps
-
Create Provisioning Plan
bashPOST /commercial/provisioning/create{"tenant_id": "customer_123","customer_name": "Acme Corp","requested_domains": ["trading", "llm", "security"],"estimated_load": {"trading": 1000, "llm": 5000}} -
Review Plan
- Verify resource allocation
- Check quota limits
- Confirm regions
-
Execute Provisioning
bashPOST /commercial/provisioning/{plan_id}/execute -
Validate Deployment
bashGET /commercial/production/health/{tenant_id} -
Run Onboarding Flow
bashPOST /onboarding/createPOST /onboarding/{flow_id}/execute -
Post-Deployment Checks
- All domains accessible
- Health checks passing
- Backups configured
- Monitoring enabled
- Customer notified
Rollback
If provisioning fails:
- Review error logs
- Clean up partial resources
- Notify customer
- Create new plan with fixes
Runbook: Handle Production Incident
Detection
- Alert received via PagerDuty/Slack
- Customer reports issue
- System health dashboard shows degradation
Classification
Severity Levels:
- Critical: System down, data loss, security breach
- High: Major feature unavailable, SLA violation
- Medium: Degraded performance, minor feature issues
- Low: Cosmetic issues, non-critical bugs
Response Steps
-
Acknowledge Incident
bashPOST /incidents/detect{"title": "LLM Kernel Latency Spike","description": "P95 latency > 5s","category": "latency","affected_domains": ["llm"],"severity": "high"} -
Assess Impact
- Check affected tenants
- Review error rates
- Check SLO violations
- Estimate customer impact
-
Mitigate
bashPOST /incidents/{incident_id}/mitigate{"mitigation_steps": ["Scale up LLM kernel nodes","Enable circuit breaker","Route traffic to backup region"]} -
Verify Resolution
- Monitor metrics for 15 minutes
- Verify SLOs restored
- Confirm customer impact resolved
-
Close Incident
bashPOST /incidents/{incident_id}/close
Escalation
- Critical: Escalate immediately to on-call engineer + manager
- High: Escalate if not resolved in 30 minutes
- Medium: Escalate if not resolved in 2 hours
- Low: Escalate if not resolved in 24 hours
Runbook: Execute Customer Onboarding
Pre-Onboarding
- Contract signed
- Customer data collected
- Technical requirements documented
Onboarding Steps
-
Create Onboarding Flow
bashPOST /onboarding/create{"customer_type": "enterprise","customer_name": "Acme Corp","customer_id": "acme_001"} -
Execute Flow
bashPOST /onboarding/{flow_id}/execute -
Verify Health Checks
- Connectivity: ✅
- Permissions: ✅
- Agents ready: ✅
- Workflows validated: ✅
-
Issue API Keys
- Generate 3 API keys
- Provide to customer
- Document in customer portal
-
Customer Training
- Schedule training session
- Provide documentation
- Set up support channels
-
Go-Live
- Enable production traffic
- Monitor for 24 hours
- Daily check-ins for first week
Runbook: Run PoV for Prospect
Pre-PoV
- Prospect qualified
- Use case identified
- Sample data collected
PoV Execution
-
Generate PoV Kit
bashPOST /pov/generate{"prospect_name": "TechCorp","prospect_id": "techcorp_pov","capabilities": ["trading", "llm"],"sample_data": {"tickets": [...],"trading_history": [...]}} -
Deploy PoV
bashPOST /pov/{deployment_id}/deploy -
Execute Playbook
bashPOST /pov/playbooks/execute{"tenant_id": "techcorp_pov","playbook_id": "trading_playbook"} -
Monitor Progress
- Daily check-ins
- Track KPI progress
- Document improvements
-
Generate Report
bashGET /pov/report/{tenant_id} -
Present Results
- Schedule demo
- Present ROI findings
- Answer questions
Runbook: Execute Major Upgrade
Pre-Upgrade
-
Validate Upgrade
bashPOST /upgrade/validate{"upgrade_type": "kernel","upgrade_target": "llm_kernel_v2","current_version": "v1.0","target_version": "v2.0"} -
Review Validation
- Check compatibility issues
- Review risk assessment
- Get approval if risk > 0.5
-
Create Upgrade Plan
bashPOST /upgrade/plan{"validation_id": "...","strategy": "canary"}
Upgrade Execution
-
Notify Customers (if required)
- Send maintenance window notice
- Update status page
-
Execute Canary Deployment
- Deploy to 10% of traffic
- Monitor for 1 hour
- Check metrics and errors
-
Expand Deployment
- Increase to 50% if canary passes
- Monitor for 1 hour
- Full rollout if stable
-
Verify Upgrade
- Check all health metrics
- Verify SLO compliance
- Confirm customer impact
-
Rollback if Needed
- Execute rollback plan
- Restore previous version
- Investigate issues
Runbook: Handle SLA Violation
Detection
- SLA violation alert received
- Customer reports SLA breach
- Automated detection triggers
Response
-
Acknowledge Violation
bashGET /sla/report/{tenant_id} -
Assess Impact
- Review violation details
- Calculate service credits
- Estimate customer impact
-
Mitigate
- Address root cause
- Restore SLA compliance
- Document resolution
-
Compensate
- Calculate service credits
- Apply to next invoice
- Notify customer
-
Prevent Recurrence
- Update monitoring
- Adjust thresholds
- Improve processes
Runbook: Generate Monthly Invoice
Pre-Invoicing
-
Collect Usage Data
- API calls
- Workflow executions
- Storage usage
- Token consumption
-
Calculate Charges
- Subscription fees
- Usage-based charges
- Marketplace revenue
- Discounts
Invoice Generation
-
Generate Invoice
bashGET /billing/invoice/{tenant_id}{"period_start": "2024-01-01","period_end": "2024-01-31"} -
Review Invoice
- Verify line items
- Check calculations
- Apply discounts
-
Send to Customer
- Email invoice
- Update customer portal
- Send payment reminder
-
Sync to Payment Processor
bashPOST /billing/invoice/{invoice_id}/sync_stripe -
Track Payment
- Monitor payment status
- Follow up on overdue
- Update accounting system
Runbook: Customer Expansion
Detection
- High ROI detected
- Usage approaching limits
- Customer requests expansion
Expansion Process
-
Identify Opportunity
bashGET /customer/lifecycle/{tenant_id}# Check upsell_opportunities -
Prepare Proposal
- Review current usage
- Calculate expansion value
- Prepare technical proposal
-
Present to Customer
- Schedule meeting
- Present ROI data
- Discuss expansion options
-
Execute Expansion
bashPOST /customer/lifecycle/{lifecycle_id}/expand{"expansion_value": 50000.0} -
Update Configuration
- Increase quotas
- Enable new features
- Update billing
Runbook: Disaster Recovery
Scenario: Region Failure
-
Detect Failure
- Region health check fails
- Customer reports unavailable
- Monitoring alerts
-
Activate Failover
bashPOST /global/failover/execute{"source_region": "us-east-1","target_region": "us-west-2"} -
Verify Failover
- Check target region health
- Verify traffic routing
- Confirm customer access
-
Monitor Recovery
- Track metrics for 24 hours
- Verify SLO compliance
- Document downtime
-
Post-Mortem
- Root cause analysis
- Update failover procedures
- Improve monitoring
Runbook: Security Incident Response
Detection
- Security alert from SentinelX
- Customer reports security issue
- Penetration test finds vulnerability
Response
-
Classify Threat
- Severity assessment
- Impact analysis
- Affected systems
-
Contain Threat
bashPOST /security/incident/contain{"threat_id": "...","actions": ["block_ip", "isolate_tenant"]} -
Investigate
- Review security logs
- Analyze attack vector
- Identify root cause
-
Remediate
- Patch vulnerabilities
- Update security rules
- Rotate credentials
-
Notify
- Internal team
- Affected customers (if required)
- Compliance team
-
Document
- Incident report
- Lessons learned
- Process improvements