MPAC Platform - Disaster Recovery Runbook
Overview
This document covers disaster recovery (DR) procedures for all three MPAC platform systems:
- mpac-smartpos — ap-northeast-1 (primary), us-east-1 (DR standby)
- mpac-pgw — ap-northeast-1 (primary), us-east-1 (DR standby)
- mpac-obs — ap-northeast-1 (primary)
Recovery Time Objective (RTO): 4 hours Recovery Point Objective (RPO): 15 minutes (RDS automated backups at 5-minute intervals)
1. Failure Scenarios
1.1 ECS Service Failure
Symptoms: Health checks failing, tasks crash-looping, 503s from ALB.
Detection: CloudWatch alarm ECS-TaskCount-Low triggers SNS → PagerDuty.
Recovery:
# Check task failure reason
aws ecs describe-tasks \
--cluster mpac-smartpos-{env} \
--tasks $(aws ecs list-tasks --cluster mpac-smartpos-{env} --query 'taskArns[0]' --output text) \
--region ap-northeast-1
# Force new deployment (rolls back to last good image)
aws ecs update-service \
--cluster mpac-smartpos-{env} \
--service svc-portal \
--force-new-deployment \
--region ap-northeast-1
# If image is bad, redeploy previous version via mpac-infra CD workflow
# GitHub Actions → deploy-production.yml → select service → enter previous image tagExpected resolution time: 5-15 minutes
1.2 RDS Primary Failure
Symptoms: Database connection errors, 5xx from all services.
Detection: RDS DatabaseConnections drops to 0, ReadLatency alarm triggers.
Recovery (RDS Multi-AZ automatic failover):
- RDS Multi-AZ automatically promotes standby replica within 1-2 minutes.
- pgBouncer reconnects automatically after DNS propagation (~30 seconds).
- No manual action required for Multi-AZ failover.
Manual failover (if automatic fails):
# Force failover to standby
aws rds failover-db-cluster \
--db-cluster-identifier mpac-smartpos-{env} \
--region ap-northeast-1
# Monitor failover progress
aws rds describe-db-instances \
--db-instance-identifier mpac-smartpos-{env} \
--query 'DBInstances[0].{Status:DBInstanceStatus,Role:ReadReplicaSourceDBInstanceIdentifier}' \
--region ap-northeast-1Expected resolution time: 2-5 minutes (automatic), 10 minutes (manual)
1.3 Redis/ElastiCache Failure
Symptoms: Authentication failures (JWT blocklist unavailable), session errors.
Detection: CacheHits drops, EngineCPUUtilization alarm.
Impact Assessment:
- svc-portal: JWT revocation list unavailable → issued tokens cannot be revoked until Redis recovers. Security impact: low (tokens expire in 15 min).
- svc-smarttab: Idempotency cache down → duplicate requests possible. Handle at DB layer.
- mpac-pgw: Rate limiting down → rate limits not enforced. Monitor for abuse.
Recovery:
# Check ElastiCache status
aws elasticache describe-replication-groups \
--replication-group-id mpac-smartpos-{env} \
--region ap-northeast-1
# ElastiCache auto-recovers with Multi-AZ. If primary node fails,
# replica is promoted within 1-2 minutes automatically.Expected resolution time: 1-3 minutes (automatic failover)
1.4 Full Region Failure (ap-northeast-1)
Symptoms: All mpac-smartpos endpoints unreachable.
Detection: External health check from ap-northeast-1 fails for >5 minutes.
Recovery Steps:
- Declare incident, notify team.
- Activate DR environment in ap-northeast-1:bash
cd mpac-infra make deploy-smartpos-prod AWS_PROFILE=prod ENV=dr # Uses mpac-smartpos/parameters/production.json with DR VPC CIDRs - Update Route53 record to point
api.mpac-cloud.comto ap-northeast-1 ALB:bashaws route53 change-resource-record-sets \ --hosted-zone-id Z0677506ALKL276Q2VVE \ --change-batch file://shared/scripts/failover-to-dr.json \ --region ap-northeast-1 - Restore latest RDS snapshot in ap-northeast-1:bash
# List recent snapshots aws rds describe-db-snapshots \ --db-instance-identifier mpac-smartpos-prod \ --query 'DBSnapshots[-5:].{ID:DBSnapshotIdentifier,Time:SnapshotCreateTime}' \ --output table --region ap-northeast-1 # Restore to ap-northeast-1 aws rds restore-db-instance-from-db-snapshot \ --db-instance-identifier mpac-smartpos-dr \ --db-snapshot-identifier <latest-snapshot-id> \ --db-instance-class db.r6g.large \ --region ap-northeast-1 - Update
DATABASE_URLsecret in ap-northeast-1 SSM:bashaws ssm put-parameter \ --name /mpac-smartpos/prod/DATABASE_URL \ --value "postgresql://..." \ --type SecureString --overwrite \ --region ap-northeast-1 - Verify services start and health checks pass.
- Communicate ETA to stakeholders.
Expected RTO: 2-4 hours
1.5 mpac-pgw Region Failure (ap-northeast-1)
Impact: QR payment processing down. Credit card (SP-NET) and cash unaffected.
Recovery Steps:
- Activate PGW DR in us-east-1:bash
AWS_REGION=us-east-1 ENV=dr ./mpac-pgw/scripts/provision.sh mpac-pgw-dr dr - Update
PGW_BASE_URLin mpac-smartpos SSM to point to us-east-1 DR endpoint. - Restore latest PGW database snapshot:bash
aws rds restore-db-instance-from-db-snapshot \ --db-instance-identifier mpac-pgw-dr \ --db-snapshot-identifier <latest-pgw-snapshot> \ --region us-east-1 - Notify payment team — QR transactions during outage window may need manual reconciliation.
Expected RTO: 2-3 hours
1.6 mpac-obs Failure
Impact: Observability only. No user-facing impact.
Recovery:
- Re-deploy ECS stack:
make deploy-obs-prod - Historical data is in S3 (Loki + Tempo). New EC2 OBS box will re-attach same S3 buckets.
- Grafana dashboards are in git (
mpac-obs/grafana/dashboards/), automatically provisioned on restart.
Expected RTO: 30-60 minutes
2. Database Backup & Restore
2.1 Automated Backups
| System | Backup Window | Retention | Point-in-Time Recovery |
|---|---|---|---|
| mpac-smartpos RDS | 02:00-03:00 UTC | 7 days | Yes (5-min intervals) |
| mpac-pgw RDS | 03:00-04:00 UTC | 7 days | Yes (5-min intervals) |
2.2 Manual Snapshot Before Risky Operations
# Before production deployments with schema changes
aws rds create-db-snapshot \
--db-instance-identifier mpac-smartpos-prod \
--db-snapshot-identifier mpac-smartpos-prod-pre-deploy-$(date +%Y%m%d) \
--region ap-northeast-12.3 Point-in-Time Recovery
# Restore to specific point in time (e.g., 30 minutes ago)
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier mpac-smartpos-prod \
--target-db-instance-identifier mpac-smartpos-pitr-$(date +%Y%m%d%H%M) \
--restore-time $(date -u -d '30 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \
--region ap-northeast-13. Incident Communication
3.1 Severity Levels
| Severity | Impact | Response Time | Update Frequency |
|---|---|---|---|
| P0 | Full outage, all users | 5 min | Every 15 min |
| P1 | Partial outage (>25% users) | 15 min | Every 30 min |
| P2 | Degraded performance | 1 hour | Every 2 hours |
| P3 | Single feature down | 4 hours | Daily |
3.2 Incident Response Steps
- Detect: Alarm fires → PagerDuty → on-call engineer.
- Acknowledge: Engineer acknowledges within response time.
- Assess: Determine severity, affected components.
- Mitigate: Apply immediate mitigation (rollback, scale up, disable feature).
- Communicate: Post to
#incidentsSlack channel. - Resolve: Full fix or workaround in place.
- Post-mortem: Write post-mortem within 48 hours for P0/P1.
4. Contact & Escalation
| Role | Escalation |
|---|---|
| On-call engineer | PagerDuty rotation |
| Platform lead | P0/P1 only |
| AWS support | For AWS infrastructure failures (Business/Enterprise support) |
5. DR Test Schedule
Perform DR drills quarterly:
- January: mpac-smartpos region failover simulation
- April: RDS point-in-time recovery drill
- July: mpac-pgw failover simulation
- October: Full DR simulation (all systems)
Document results in docs/dr-test-results/YYYY-QQ.md.