Operational Runbook
This runbook provides step-by-step instructions for common operational tasks across all three MPAC platform systems.
Table of Contents
- Initial Deployment
- Scaling Services
- Viewing Logs
- Database Access
- Secret Rotation
- Rollback Procedures
- Disaster Recovery
- Troubleshooting
Initial Deployment
mpac-smartpos (ap-northeast-1)
bash
# 1. Deploy infrastructure
make deploy-smartpos-dev
# 2. Update application secrets
cd mpac-smartpos/scripts
./update-secrets.sh
# 3. Build and push Docker images (from application repos)
# svc-portal
aws ecr get-login-password --region ap-northeast-1 | docker login --username AWS --password-stdin <ECR_URI>
docker build -t <ECR_URI>/mpac-smartpos/svc-portal:latest .
docker push <ECR_URI>/mpac-smartpos/svc-portal:latest
# svc-smarttab
docker build -t <ECR_URI>/mpac-smartpos/svc-smarttab:latest .
docker push <ECR_URI>/mpac-smartpos/svc-smarttab:latest
# 4. Scale up services
aws ecs update-service --cluster mpac-smartpos-dev --service svc-portal --desired-count 1 --region ap-northeast-1
aws ecs update-service --cluster mpac-smartpos-dev --service svc-smarttab --desired-count 1 --region ap-northeast-1
# 5. Wait for stability
aws ecs wait services-stable --cluster mpac-smartpos-dev --services svc-portal svc-smarttab --region ap-northeast-1mpac-pgw (ap-northeast-1)
bash
# 1. Deploy infrastructure
make deploy-pgw-dev
# 2. Build and push Docker image
docker build -t <ECR_URI>/mpac-pgw/pgw-backend:latest .
docker push <ECR_URI>/mpac-pgw/pgw-backend:latest
# 3. Scale up service
aws ecs update-service --cluster mpac-pgw-dev --service pgw-backend --desired-count 1 --region ap-northeast-1
# 4. Wait for stability
aws ecs wait services-stable --cluster mpac-pgw-dev --services pgw-backend --region ap-northeast-1mpac-obs (ap-northeast-1)
bash
# 1. Mirror Docker images to ECR
cd mpac-obs/scripts
./mirror-images-to-ecr.sh
# 2. Deploy full stack
make deploy-obs-dev
# 3. Access Grafana
# URL will be shown in deployment outputScaling Services
ECS Service Scaling
bash
# mpac-smartpos
aws ecs update-service --cluster mpac-smartpos-<ENV> --service svc-portal --desired-count <N> --region ap-northeast-1
aws ecs update-service --cluster mpac-smartpos-<ENV> --service svc-smarttab --desired-count <N> --region ap-northeast-1
# mpac-pgw
aws ecs update-service --cluster mpac-pgw-<ENV> --service pgw-backend --desired-count <N> --region ap-northeast-1
# mpac-obs
aws ecs update-service --cluster mpac-obs-<ENV> --service mpac-obs-alloy-<ENV> --desired-count <N> --region ap-northeast-1
aws ecs update-service --cluster mpac-obs-<ENV> --service mpac-obs-grafana-<ENV> --desired-count <N> --region ap-northeast-1Viewing Logs
CloudWatch Logs
bash
# mpac-smartpos
aws logs tail /ecs/mpac-smartpos-dev/svc-portal --follow --region ap-northeast-1
aws logs tail /ecs/mpac-smartpos-dev/svc-smarttab --follow --region ap-northeast-1
# mpac-pgw
aws logs tail /ecs/mpac-pgw-dev --follow --region ap-northeast-1
# mpac-obs
aws logs tail /ecs/mpac-obs/alloy-dev --follow --region ap-northeast-1
aws logs tail /ecs/mpac-obs/grafana-dev --follow --region ap-northeast-1OBS Box Logs (mpac-obs)
bash
# Connect via SSM
INSTANCE_ID=$(aws cloudformation describe-stacks \
--stack-name mpac-obs-obs-box-dev \
--query 'Stacks[0].Outputs[?OutputKey==`OBSBoxInstanceId`].OutputValue' \
--output text --region ap-northeast-1)
aws ssm start-session --target $INSTANCE_ID --region ap-northeast-1
# Once connected:
cd /opt/obs-box
docker compose logs --tail=100
docker compose logs prometheus --tail=50
docker compose logs loki --tail=50
docker compose logs tempo --tail=50Database Access
mpac-smartpos RDS
Access is through ECS Exec or bastion host (if configured).
bash
# Via ECS Exec
TASK_ARN=$(aws ecs list-tasks --cluster mpac-smartpos-dev --service-name svc-portal \
--query 'taskArns[0]' --output text --region ap-northeast-1)
aws ecs execute-command --cluster mpac-smartpos-dev --task $TASK_ARN \
--container svc-portal --interactive --command "/bin/sh" --region ap-northeast-1mpac-pgw RDS
Access through the bastion host.
bash
# Get bastion IP and key
BASTION_IP=$(aws cloudformation describe-stacks --stack-name mpac-pgw-dev \
--query 'Stacks[0].Outputs[?OutputKey==`BastionPublicIP`].OutputValue' \
--output text --region ap-northeast-1)
KEY_PAIR_ID=$(aws cloudformation describe-stacks --stack-name mpac-pgw-dev \
--query 'Stacks[0].Outputs[?OutputKey==`BastionKeyPairId`].OutputValue' \
--output text --region ap-northeast-1)
# Download private key
aws ssm get-parameter --name /ec2/keypair/$KEY_PAIR_ID \
--with-decryption --query 'Parameter.Value' --output text --region ap-northeast-1 > bastion-key.pem
chmod 600 bastion-key.pem
# SSH tunnel to RDS
RDS_ENDPOINT=$(aws cloudformation describe-stacks --stack-name mpac-pgw-dev \
--query 'Stacks[0].Outputs[?OutputKey==`RdsEndpoint`].OutputValue' \
--output text --region ap-northeast-1)
ssh -i bastion-key.pem -L 5433:$RDS_ENDPOINT:5432 ec2-user@$BASTION_IP
# Connect to DB (from another terminal)
psql -h localhost -p 5433 -U pgw_admin -d pgwdbSecret Rotation
mpac-smartpos Secrets
bash
cd mpac-smartpos/scripts
STACK_NAME=mpac-smartpos-dev AWS_REGION=ap-northeast-1 ./update-secrets.shRDS Password Rotation
RDS passwords are managed by AWS Secrets Manager with auto-generated passwords. To manually rotate:
bash
aws secretsmanager rotate-secret \
--secret-id <SECRET_ARN> \
--region <REGION>Rollback Procedures
CloudFormation Stack Rollback
If a deployment fails:
bash
# Check current status
aws cloudformation describe-stacks --stack-name <STACK_NAME> --region <REGION> \
--query 'Stacks[0].StackStatus'
# View failure events
aws cloudformation describe-stack-events --stack-name <STACK_NAME> --region <REGION> \
--query 'StackEvents[?ResourceStatus==`CREATE_FAILED` || ResourceStatus==`UPDATE_FAILED`]' \
--output table
# If stuck in UPDATE_ROLLBACK_FAILED
aws cloudformation continue-update-rollback --stack-name <STACK_NAME> --region <REGION>
# If stack needs to be deleted and recreated
aws cloudformation delete-stack --stack-name <STACK_NAME> --region <REGION>
aws cloudformation wait stack-delete-complete --stack-name <STACK_NAME> --region <REGION>ECS Service Rollback
ECS services have deployment circuit breakers enabled. Failed deployments auto-rollback.
To manually rollback to a previous task definition:
bash
# List recent task definitions
aws ecs list-task-definitions --family-prefix <FAMILY> --sort DESC --max-items 5 --region <REGION>
# Update service to use previous revision
aws ecs update-service --cluster <CLUSTER> --service <SERVICE> \
--task-definition <FAMILY>:<PREVIOUS_REVISION> --region <REGION>Disaster Recovery
Data Backup Locations
| System | Data | Backup Strategy |
|---|---|---|
| mpac-smartpos | RDS PostgreSQL | Automated snapshots (7-day retention) |
| mpac-pgw | RDS PostgreSQL | Automated snapshots (7-day retention) |
| mpac-obs | Prometheus | Local TSDB (14-day retention) |
| mpac-obs | Loki | S3 (14-day lifecycle policy) |
| mpac-obs | Tempo | S3 (14-day lifecycle policy) |
RDS Point-in-Time Recovery
bash
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier <SOURCE_INSTANCE> \
--target-db-instance-identifier <TARGET_INSTANCE> \
--restore-time <TIMESTAMP> \
--region <REGION>Troubleshooting
Common Issues
ECS tasks fail to start
- Check CloudWatch logs for the task
- Verify secrets are populated correctly
- Check security group rules allow connectivity to RDS/Redis
- Verify ECR image exists and is accessible
RDS connection timeout
- Verify ECS security group has access to RDS security group
- Check RDS is in the correct subnet group
- Verify database credentials in Secrets Manager
mpac-obs OBS Box not healthy
- SSM into the instance and check Docker logs
- Verify S3 bucket permissions for Loki/Tempo
- Check EBS volume has sufficient space:
df -h - Restart services:
cd /opt/obs-box && docker compose restart
ALB returns 503
- Check ECS service has running tasks
- Verify target group health checks are passing
- Check ECS service events for deployment failures
Health Check Endpoints
| System | Endpoint | Expected |
|---|---|---|
| mpac-smartpos | /health | 200 OK |
| mpac-smartpos | /api/v1/auth/health | 200 OK |
| mpac-pgw | /health | 200 OK |
| mpac-pgw | /health/ready | 200 OK |
| mpac-obs Grafana | /api/health | 200 OK |
| mpac-obs Alloy | /-/ready (port 12345) | 200 OK |
| mpac-obs Prometheus | /-/healthy (port 9090) | 200 OK |
| mpac-obs Loki | /ready (port 3100) | 200 OK |
| mpac-obs Tempo | /ready (port 3200) | 200 OK |