MPAC Platform - Disaster Recovery Runbook

Overview

This document covers disaster recovery (DR) procedures for all three MPAC platform systems:

mpac-smartpos — ap-northeast-1 (primary), us-east-1 (DR standby)
mpac-pgw — ap-northeast-1 (primary), us-east-1 (DR standby)
mpac-obs — ap-northeast-1 (primary)

Recovery Time Objective (RTO): 4 hours Recovery Point Objective (RPO): 15 minutes (RDS automated backups at 5-minute intervals)

1. Failure Scenarios

1.1 ECS Service Failure

Symptoms: Health checks failing, tasks crash-looping, 503s from ALB.

Detection: CloudWatch alarm ECS-TaskCount-Low triggers SNS → PagerDuty.

Recovery:

bash

# Check task failure reason
aws ecs describe-tasks \
  --cluster mpac-smartpos-{env} \
  --tasks $(aws ecs list-tasks --cluster mpac-smartpos-{env} --query 'taskArns[0]' --output text) \
  --region ap-northeast-1

# Force new deployment (rolls back to last good image)
aws ecs update-service \
  --cluster mpac-smartpos-{env} \
  --service svc-portal \
  --force-new-deployment \
  --region ap-northeast-1

# If image is bad, redeploy previous version via mpac-infra CD workflow
# GitHub Actions → deploy-production.yml → select service → enter previous image tag

Expected resolution time: 5-15 minutes

1.2 RDS Primary Failure

Symptoms: Database connection errors, 5xx from all services.

Detection: RDS DatabaseConnections drops to 0, ReadLatency alarm triggers.

Recovery (RDS Multi-AZ automatic failover):

RDS Multi-AZ automatically promotes standby replica within 1-2 minutes.
pgBouncer reconnects automatically after DNS propagation (~30 seconds).
No manual action required for Multi-AZ failover.

Manual failover (if automatic fails):

bash

# Force failover to standby
aws rds failover-db-cluster \
  --db-cluster-identifier mpac-smartpos-{env} \
  --region ap-northeast-1

# Monitor failover progress
aws rds describe-db-instances \
  --db-instance-identifier mpac-smartpos-{env} \
  --query 'DBInstances[0].{Status:DBInstanceStatus,Role:ReadReplicaSourceDBInstanceIdentifier}' \
  --region ap-northeast-1

Expected resolution time: 2-5 minutes (automatic), 10 minutes (manual)

1.3 Redis/ElastiCache Failure

Symptoms: Authentication failures (JWT blocklist unavailable), session errors.

Detection: CacheHits drops, EngineCPUUtilization alarm.

Impact Assessment:

svc-portal: JWT revocation list unavailable → issued tokens cannot be revoked until Redis recovers. Security impact: low (tokens expire in 15 min).
svc-smarttab: Idempotency cache down → duplicate requests possible. Handle at DB layer.
mpac-pgw: Rate limiting down → rate limits not enforced. Monitor for abuse.

Recovery:

bash

# Check ElastiCache status
aws elasticache describe-replication-groups \
  --replication-group-id mpac-smartpos-{env} \
  --region ap-northeast-1

# ElastiCache auto-recovers with Multi-AZ. If primary node fails,
# replica is promoted within 1-2 minutes automatically.

Expected resolution time: 1-3 minutes (automatic failover)

1.4 Full Region Failure (ap-northeast-1)

Symptoms: All mpac-smartpos endpoints unreachable.

Detection: External health check from ap-northeast-1 fails for >5 minutes.

Recovery Steps:

Declare incident, notify team.

Activate DR environment in ap-northeast-1:

bash

cd mpac-infra
make deploy-smartpos-prod AWS_PROFILE=prod ENV=dr
# Uses mpac-smartpos/parameters/production.json with DR VPC CIDRs

Update Route53 record to point api.mpac-cloud.com to ap-northeast-1 ALB:

bash

aws route53 change-resource-record-sets \
  --hosted-zone-id Z0677506ALKL276Q2VVE \
  --change-batch file://shared/scripts/failover-to-dr.json \
  --region ap-northeast-1

Restore latest RDS snapshot in ap-northeast-1:

bash

# List recent snapshots
aws rds describe-db-snapshots \
  --db-instance-identifier mpac-smartpos-prod \
  --query 'DBSnapshots[-5:].{ID:DBSnapshotIdentifier,Time:SnapshotCreateTime}' \
  --output table --region ap-northeast-1

# Restore to ap-northeast-1
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier mpac-smartpos-dr \
  --db-snapshot-identifier <latest-snapshot-id> \
  --db-instance-class db.r6g.large \
  --region ap-northeast-1

Update DATABASE_URL secret in ap-northeast-1 SSM:

bash

aws ssm put-parameter \
  --name /mpac-smartpos/prod/DATABASE_URL \
  --value "postgresql://..." \
  --type SecureString --overwrite \
  --region ap-northeast-1

Verify services start and health checks pass.
Communicate ETA to stakeholders.

Expected RTO: 2-4 hours

1.5 mpac-pgw Region Failure (ap-northeast-1)

Impact: QR payment processing down. Credit card (SP-NET) and cash unaffected.

Recovery Steps:

Activate PGW DR in us-east-1:

bash

AWS_REGION=us-east-1 ENV=dr ./mpac-pgw/scripts/provision.sh mpac-pgw-dr dr

Update PGW_BASE_URL in mpac-smartpos SSM to point to us-east-1 DR endpoint.

Restore latest PGW database snapshot:

bash

aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier mpac-pgw-dr \
  --db-snapshot-identifier <latest-pgw-snapshot> \
  --region us-east-1

Notify payment team — QR transactions during outage window may need manual reconciliation.

Expected RTO: 2-3 hours

1.6 mpac-obs Failure

Impact: Observability only. No user-facing impact.

Recovery:

Re-deploy ECS stack: make deploy-obs-prod
Historical data is in S3 (Loki + Tempo). New EC2 OBS box will re-attach same S3 buckets.
Grafana dashboards are in git (mpac-obs/grafana/dashboards/), automatically provisioned on restart.

Expected RTO: 30-60 minutes

2. Database Backup & Restore

2.1 Automated Backups

System	Backup Window	Retention	Point-in-Time Recovery
mpac-smartpos RDS	02:00-03:00 UTC	7 days	Yes (5-min intervals)
mpac-pgw RDS	03:00-04:00 UTC	7 days	Yes (5-min intervals)

2.2 Manual Snapshot Before Risky Operations

bash

# Before production deployments with schema changes
aws rds create-db-snapshot \
  --db-instance-identifier mpac-smartpos-prod \
  --db-snapshot-identifier mpac-smartpos-prod-pre-deploy-$(date +%Y%m%d) \
  --region ap-northeast-1

2.3 Point-in-Time Recovery

bash

# Restore to specific point in time (e.g., 30 minutes ago)
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier mpac-smartpos-prod \
  --target-db-instance-identifier mpac-smartpos-pitr-$(date +%Y%m%d%H%M) \
  --restore-time $(date -u -d '30 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \
  --region ap-northeast-1

3. Incident Communication

3.1 Severity Levels

Severity	Impact	Response Time	Update Frequency
P0	Full outage, all users	5 min	Every 15 min
P1	Partial outage (>25% users)	15 min	Every 30 min
P2	Degraded performance	1 hour	Every 2 hours
P3	Single feature down	4 hours	Daily

3.2 Incident Response Steps

Detect: Alarm fires → PagerDuty → on-call engineer.
Acknowledge: Engineer acknowledges within response time.
Assess: Determine severity, affected components.
Mitigate: Apply immediate mitigation (rollback, scale up, disable feature).
Communicate: Post to #incidents Slack channel.
Resolve: Full fix or workaround in place.
Post-mortem: Write post-mortem within 48 hours for P0/P1.

4. Contact & Escalation

Role	Escalation
On-call engineer	PagerDuty rotation
Platform lead	P0/P1 only
AWS support	For AWS infrastructure failures (Business/Enterprise support)

5. DR Test Schedule

Perform DR drills quarterly:

January: mpac-smartpos region failover simulation
April: RDS point-in-time recovery drill
July: mpac-pgw failover simulation
October: Full DR simulation (all systems)

Document results in docs/dr-test-results/YYYY-QQ.md.

MPAC Platform - Disaster Recovery Runbook ​

Overview ​

1. Failure Scenarios ​

1.1 ECS Service Failure ​

1.2 RDS Primary Failure ​

1.3 Redis/ElastiCache Failure ​

1.4 Full Region Failure (ap-northeast-1) ​

1.5 mpac-pgw Region Failure (ap-northeast-1) ​

1.6 mpac-obs Failure ​

2. Database Backup & Restore ​

2.1 Automated Backups ​

2.2 Manual Snapshot Before Risky Operations ​

2.3 Point-in-Time Recovery ​

3. Incident Communication ​

3.1 Severity Levels ​

3.2 Incident Response Steps ​

4. Contact & Escalation ​

5. DR Test Schedule ​

MPAC Platform - Disaster Recovery Runbook

Overview

1. Failure Scenarios

1.1 ECS Service Failure

1.2 RDS Primary Failure

1.3 Redis/ElastiCache Failure

1.4 Full Region Failure (ap-northeast-1)

1.5 mpac-pgw Region Failure (ap-northeast-1)

1.6 mpac-obs Failure

2. Database Backup & Restore

2.1 Automated Backups

2.2 Manual Snapshot Before Risky Operations

2.3 Point-in-Time Recovery

3. Incident Communication

3.1 Severity Levels

3.2 Incident Response Steps

4. Contact & Escalation

5. DR Test Schedule