Skip to content

MPAC Platform - Disaster Recovery Runbook

Overview

This document covers disaster recovery (DR) procedures for all three MPAC platform systems:

  • mpac-smartpos — ap-northeast-1 (primary), us-east-1 (DR standby)
  • mpac-pgw — ap-northeast-1 (primary), us-east-1 (DR standby)
  • mpac-obs — ap-northeast-1 (primary)

Recovery Time Objective (RTO): 4 hours Recovery Point Objective (RPO): 15 minutes (RDS automated backups at 5-minute intervals)


1. Failure Scenarios

1.1 ECS Service Failure

Symptoms: Health checks failing, tasks crash-looping, 503s from ALB.

Detection: CloudWatch alarm ECS-TaskCount-Low triggers SNS → PagerDuty.

Recovery:

bash
# Check task failure reason
aws ecs describe-tasks \
  --cluster mpac-smartpos-{env} \
  --tasks $(aws ecs list-tasks --cluster mpac-smartpos-{env} --query 'taskArns[0]' --output text) \
  --region ap-northeast-1

# Force new deployment (rolls back to last good image)
aws ecs update-service \
  --cluster mpac-smartpos-{env} \
  --service svc-portal \
  --force-new-deployment \
  --region ap-northeast-1

# If image is bad, redeploy previous version via mpac-infra CD workflow
# GitHub Actions → deploy-production.yml → select service → enter previous image tag

Expected resolution time: 5-15 minutes


1.2 RDS Primary Failure

Symptoms: Database connection errors, 5xx from all services.

Detection: RDS DatabaseConnections drops to 0, ReadLatency alarm triggers.

Recovery (RDS Multi-AZ automatic failover):

  • RDS Multi-AZ automatically promotes standby replica within 1-2 minutes.
  • pgBouncer reconnects automatically after DNS propagation (~30 seconds).
  • No manual action required for Multi-AZ failover.

Manual failover (if automatic fails):

bash
# Force failover to standby
aws rds failover-db-cluster \
  --db-cluster-identifier mpac-smartpos-{env} \
  --region ap-northeast-1

# Monitor failover progress
aws rds describe-db-instances \
  --db-instance-identifier mpac-smartpos-{env} \
  --query 'DBInstances[0].{Status:DBInstanceStatus,Role:ReadReplicaSourceDBInstanceIdentifier}' \
  --region ap-northeast-1

Expected resolution time: 2-5 minutes (automatic), 10 minutes (manual)


1.3 Redis/ElastiCache Failure

Symptoms: Authentication failures (JWT blocklist unavailable), session errors.

Detection: CacheHits drops, EngineCPUUtilization alarm.

Impact Assessment:

  • svc-portal: JWT revocation list unavailable → issued tokens cannot be revoked until Redis recovers. Security impact: low (tokens expire in 15 min).
  • svc-smarttab: Idempotency cache down → duplicate requests possible. Handle at DB layer.
  • mpac-pgw: Rate limiting down → rate limits not enforced. Monitor for abuse.

Recovery:

bash
# Check ElastiCache status
aws elasticache describe-replication-groups \
  --replication-group-id mpac-smartpos-{env} \
  --region ap-northeast-1

# ElastiCache auto-recovers with Multi-AZ. If primary node fails,
# replica is promoted within 1-2 minutes automatically.

Expected resolution time: 1-3 minutes (automatic failover)


1.4 Full Region Failure (ap-northeast-1)

Symptoms: All mpac-smartpos endpoints unreachable.

Detection: External health check from ap-northeast-1 fails for >5 minutes.

Recovery Steps:

  1. Declare incident, notify team.
  2. Activate DR environment in ap-northeast-1:
    bash
    cd mpac-infra
    make deploy-smartpos-prod AWS_PROFILE=prod ENV=dr
    # Uses mpac-smartpos/parameters/production.json with DR VPC CIDRs
  3. Update Route53 record to point api.mpac-cloud.com to ap-northeast-1 ALB:
    bash
    aws route53 change-resource-record-sets \
      --hosted-zone-id Z0677506ALKL276Q2VVE \
      --change-batch file://shared/scripts/failover-to-dr.json \
      --region ap-northeast-1
  4. Restore latest RDS snapshot in ap-northeast-1:
    bash
    # List recent snapshots
    aws rds describe-db-snapshots \
      --db-instance-identifier mpac-smartpos-prod \
      --query 'DBSnapshots[-5:].{ID:DBSnapshotIdentifier,Time:SnapshotCreateTime}' \
      --output table --region ap-northeast-1
    
    # Restore to ap-northeast-1
    aws rds restore-db-instance-from-db-snapshot \
      --db-instance-identifier mpac-smartpos-dr \
      --db-snapshot-identifier <latest-snapshot-id> \
      --db-instance-class db.r6g.large \
      --region ap-northeast-1
  5. Update DATABASE_URL secret in ap-northeast-1 SSM:
    bash
    aws ssm put-parameter \
      --name /mpac-smartpos/prod/DATABASE_URL \
      --value "postgresql://..." \
      --type SecureString --overwrite \
      --region ap-northeast-1
  6. Verify services start and health checks pass.
  7. Communicate ETA to stakeholders.

Expected RTO: 2-4 hours


1.5 mpac-pgw Region Failure (ap-northeast-1)

Impact: QR payment processing down. Credit card (SP-NET) and cash unaffected.

Recovery Steps:

  1. Activate PGW DR in us-east-1:
    bash
    AWS_REGION=us-east-1 ENV=dr ./mpac-pgw/scripts/provision.sh mpac-pgw-dr dr
  2. Update PGW_BASE_URL in mpac-smartpos SSM to point to us-east-1 DR endpoint.
  3. Restore latest PGW database snapshot:
    bash
    aws rds restore-db-instance-from-db-snapshot \
      --db-instance-identifier mpac-pgw-dr \
      --db-snapshot-identifier <latest-pgw-snapshot> \
      --region us-east-1
  4. Notify payment team — QR transactions during outage window may need manual reconciliation.

Expected RTO: 2-3 hours


1.6 mpac-obs Failure

Impact: Observability only. No user-facing impact.

Recovery:

  • Re-deploy ECS stack: make deploy-obs-prod
  • Historical data is in S3 (Loki + Tempo). New EC2 OBS box will re-attach same S3 buckets.
  • Grafana dashboards are in git (mpac-obs/grafana/dashboards/), automatically provisioned on restart.

Expected RTO: 30-60 minutes


2. Database Backup & Restore

2.1 Automated Backups

SystemBackup WindowRetentionPoint-in-Time Recovery
mpac-smartpos RDS02:00-03:00 UTC7 daysYes (5-min intervals)
mpac-pgw RDS03:00-04:00 UTC7 daysYes (5-min intervals)

2.2 Manual Snapshot Before Risky Operations

bash
# Before production deployments with schema changes
aws rds create-db-snapshot \
  --db-instance-identifier mpac-smartpos-prod \
  --db-snapshot-identifier mpac-smartpos-prod-pre-deploy-$(date +%Y%m%d) \
  --region ap-northeast-1

2.3 Point-in-Time Recovery

bash
# Restore to specific point in time (e.g., 30 minutes ago)
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier mpac-smartpos-prod \
  --target-db-instance-identifier mpac-smartpos-pitr-$(date +%Y%m%d%H%M) \
  --restore-time $(date -u -d '30 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \
  --region ap-northeast-1

3. Incident Communication

3.1 Severity Levels

SeverityImpactResponse TimeUpdate Frequency
P0Full outage, all users5 minEvery 15 min
P1Partial outage (>25% users)15 minEvery 30 min
P2Degraded performance1 hourEvery 2 hours
P3Single feature down4 hoursDaily

3.2 Incident Response Steps

  1. Detect: Alarm fires → PagerDuty → on-call engineer.
  2. Acknowledge: Engineer acknowledges within response time.
  3. Assess: Determine severity, affected components.
  4. Mitigate: Apply immediate mitigation (rollback, scale up, disable feature).
  5. Communicate: Post to #incidents Slack channel.
  6. Resolve: Full fix or workaround in place.
  7. Post-mortem: Write post-mortem within 48 hours for P0/P1.

4. Contact & Escalation

RoleEscalation
On-call engineerPagerDuty rotation
Platform leadP0/P1 only
AWS supportFor AWS infrastructure failures (Business/Enterprise support)

5. DR Test Schedule

Perform DR drills quarterly:

  • January: mpac-smartpos region failover simulation
  • April: RDS point-in-time recovery drill
  • July: mpac-pgw failover simulation
  • October: Full DR simulation (all systems)

Document results in docs/dr-test-results/YYYY-QQ.md.

MPAC — MP-Solution Advanced Cloud Service