Skip to content

Operational Runbook

This runbook provides step-by-step instructions for common operational tasks across all three MPAC platform systems.

Table of Contents

  1. Initial Deployment
  2. Scaling Services
  3. Viewing Logs
  4. Database Access
  5. Secret Rotation
  6. Rollback Procedures
  7. Disaster Recovery
  8. Troubleshooting

Initial Deployment

mpac-smartpos (ap-northeast-1)

bash
# 1. Deploy infrastructure
make deploy-smartpos-dev

# 2. Update application secrets
cd mpac-smartpos/scripts
./update-secrets.sh

# 3. Build and push Docker images (from application repos)
# svc-portal
aws ecr get-login-password --region ap-northeast-1 | docker login --username AWS --password-stdin <ECR_URI>
docker build -t <ECR_URI>/mpac-smartpos/svc-portal:latest .
docker push <ECR_URI>/mpac-smartpos/svc-portal:latest

# svc-smarttab
docker build -t <ECR_URI>/mpac-smartpos/svc-smarttab:latest .
docker push <ECR_URI>/mpac-smartpos/svc-smarttab:latest

# 4. Scale up services
aws ecs update-service --cluster mpac-smartpos-dev --service svc-portal --desired-count 1 --region ap-northeast-1
aws ecs update-service --cluster mpac-smartpos-dev --service svc-smarttab --desired-count 1 --region ap-northeast-1

# 5. Wait for stability
aws ecs wait services-stable --cluster mpac-smartpos-dev --services svc-portal svc-smarttab --region ap-northeast-1

mpac-pgw (ap-northeast-1)

bash
# 1. Deploy infrastructure
make deploy-pgw-dev

# 2. Build and push Docker image
docker build -t <ECR_URI>/mpac-pgw/pgw-backend:latest .
docker push <ECR_URI>/mpac-pgw/pgw-backend:latest

# 3. Scale up service
aws ecs update-service --cluster mpac-pgw-dev --service pgw-backend --desired-count 1 --region ap-northeast-1

# 4. Wait for stability
aws ecs wait services-stable --cluster mpac-pgw-dev --services pgw-backend --region ap-northeast-1

mpac-obs (ap-northeast-1)

bash
# 1. Mirror Docker images to ECR
cd mpac-obs/scripts
./mirror-images-to-ecr.sh

# 2. Deploy full stack
make deploy-obs-dev

# 3. Access Grafana
# URL will be shown in deployment output

Scaling Services

ECS Service Scaling

bash
# mpac-smartpos
aws ecs update-service --cluster mpac-smartpos-<ENV> --service svc-portal --desired-count <N> --region ap-northeast-1
aws ecs update-service --cluster mpac-smartpos-<ENV> --service svc-smarttab --desired-count <N> --region ap-northeast-1

# mpac-pgw
aws ecs update-service --cluster mpac-pgw-<ENV> --service pgw-backend --desired-count <N> --region ap-northeast-1

# mpac-obs
aws ecs update-service --cluster mpac-obs-<ENV> --service mpac-obs-alloy-<ENV> --desired-count <N> --region ap-northeast-1
aws ecs update-service --cluster mpac-obs-<ENV> --service mpac-obs-grafana-<ENV> --desired-count <N> --region ap-northeast-1

Viewing Logs

CloudWatch Logs

bash
# mpac-smartpos
aws logs tail /ecs/mpac-smartpos-dev/svc-portal --follow --region ap-northeast-1
aws logs tail /ecs/mpac-smartpos-dev/svc-smarttab --follow --region ap-northeast-1

# mpac-pgw
aws logs tail /ecs/mpac-pgw-dev --follow --region ap-northeast-1

# mpac-obs
aws logs tail /ecs/mpac-obs/alloy-dev --follow --region ap-northeast-1
aws logs tail /ecs/mpac-obs/grafana-dev --follow --region ap-northeast-1

OBS Box Logs (mpac-obs)

bash
# Connect via SSM
INSTANCE_ID=$(aws cloudformation describe-stacks \
    --stack-name mpac-obs-obs-box-dev \
    --query 'Stacks[0].Outputs[?OutputKey==`OBSBoxInstanceId`].OutputValue' \
    --output text --region ap-northeast-1)

aws ssm start-session --target $INSTANCE_ID --region ap-northeast-1

# Once connected:
cd /opt/obs-box
docker compose logs --tail=100
docker compose logs prometheus --tail=50
docker compose logs loki --tail=50
docker compose logs tempo --tail=50

Database Access

mpac-smartpos RDS

Access is through ECS Exec or bastion host (if configured).

bash
# Via ECS Exec
TASK_ARN=$(aws ecs list-tasks --cluster mpac-smartpos-dev --service-name svc-portal \
    --query 'taskArns[0]' --output text --region ap-northeast-1)
aws ecs execute-command --cluster mpac-smartpos-dev --task $TASK_ARN \
    --container svc-portal --interactive --command "/bin/sh" --region ap-northeast-1

mpac-pgw RDS

Access through the bastion host.

bash
# Get bastion IP and key
BASTION_IP=$(aws cloudformation describe-stacks --stack-name mpac-pgw-dev \
    --query 'Stacks[0].Outputs[?OutputKey==`BastionPublicIP`].OutputValue' \
    --output text --region ap-northeast-1)

KEY_PAIR_ID=$(aws cloudformation describe-stacks --stack-name mpac-pgw-dev \
    --query 'Stacks[0].Outputs[?OutputKey==`BastionKeyPairId`].OutputValue' \
    --output text --region ap-northeast-1)

# Download private key
aws ssm get-parameter --name /ec2/keypair/$KEY_PAIR_ID \
    --with-decryption --query 'Parameter.Value' --output text --region ap-northeast-1 > bastion-key.pem
chmod 600 bastion-key.pem

# SSH tunnel to RDS
RDS_ENDPOINT=$(aws cloudformation describe-stacks --stack-name mpac-pgw-dev \
    --query 'Stacks[0].Outputs[?OutputKey==`RdsEndpoint`].OutputValue' \
    --output text --region ap-northeast-1)

ssh -i bastion-key.pem -L 5433:$RDS_ENDPOINT:5432 ec2-user@$BASTION_IP

# Connect to DB (from another terminal)
psql -h localhost -p 5433 -U pgw_admin -d pgwdb

Secret Rotation

mpac-smartpos Secrets

bash
cd mpac-smartpos/scripts
STACK_NAME=mpac-smartpos-dev AWS_REGION=ap-northeast-1 ./update-secrets.sh

RDS Password Rotation

RDS passwords are managed by AWS Secrets Manager with auto-generated passwords. To manually rotate:

bash
aws secretsmanager rotate-secret \
    --secret-id <SECRET_ARN> \
    --region <REGION>

Rollback Procedures

CloudFormation Stack Rollback

If a deployment fails:

bash
# Check current status
aws cloudformation describe-stacks --stack-name <STACK_NAME> --region <REGION> \
    --query 'Stacks[0].StackStatus'

# View failure events
aws cloudformation describe-stack-events --stack-name <STACK_NAME> --region <REGION> \
    --query 'StackEvents[?ResourceStatus==`CREATE_FAILED` || ResourceStatus==`UPDATE_FAILED`]' \
    --output table

# If stuck in UPDATE_ROLLBACK_FAILED
aws cloudformation continue-update-rollback --stack-name <STACK_NAME> --region <REGION>

# If stack needs to be deleted and recreated
aws cloudformation delete-stack --stack-name <STACK_NAME> --region <REGION>
aws cloudformation wait stack-delete-complete --stack-name <STACK_NAME> --region <REGION>

ECS Service Rollback

ECS services have deployment circuit breakers enabled. Failed deployments auto-rollback.

To manually rollback to a previous task definition:

bash
# List recent task definitions
aws ecs list-task-definitions --family-prefix <FAMILY> --sort DESC --max-items 5 --region <REGION>

# Update service to use previous revision
aws ecs update-service --cluster <CLUSTER> --service <SERVICE> \
    --task-definition <FAMILY>:<PREVIOUS_REVISION> --region <REGION>

Disaster Recovery

Data Backup Locations

SystemDataBackup Strategy
mpac-smartposRDS PostgreSQLAutomated snapshots (7-day retention)
mpac-pgwRDS PostgreSQLAutomated snapshots (7-day retention)
mpac-obsPrometheusLocal TSDB (14-day retention)
mpac-obsLokiS3 (14-day lifecycle policy)
mpac-obsTempoS3 (14-day lifecycle policy)

RDS Point-in-Time Recovery

bash
aws rds restore-db-instance-to-point-in-time \
    --source-db-instance-identifier <SOURCE_INSTANCE> \
    --target-db-instance-identifier <TARGET_INSTANCE> \
    --restore-time <TIMESTAMP> \
    --region <REGION>

Troubleshooting

Common Issues

ECS tasks fail to start

  1. Check CloudWatch logs for the task
  2. Verify secrets are populated correctly
  3. Check security group rules allow connectivity to RDS/Redis
  4. Verify ECR image exists and is accessible

RDS connection timeout

  1. Verify ECS security group has access to RDS security group
  2. Check RDS is in the correct subnet group
  3. Verify database credentials in Secrets Manager

mpac-obs OBS Box not healthy

  1. SSM into the instance and check Docker logs
  2. Verify S3 bucket permissions for Loki/Tempo
  3. Check EBS volume has sufficient space: df -h
  4. Restart services: cd /opt/obs-box && docker compose restart

ALB returns 503

  1. Check ECS service has running tasks
  2. Verify target group health checks are passing
  3. Check ECS service events for deployment failures

Health Check Endpoints

SystemEndpointExpected
mpac-smartpos/health200 OK
mpac-smartpos/api/v1/auth/health200 OK
mpac-pgw/health200 OK
mpac-pgw/health/ready200 OK
mpac-obs Grafana/api/health200 OK
mpac-obs Alloy/-/ready (port 12345)200 OK
mpac-obs Prometheus/-/healthy (port 9090)200 OK
mpac-obs Loki/ready (port 3100)200 OK
mpac-obs Tempo/ready (port 3200)200 OK

MPAC — MP-Solution Advanced Cloud Service