MPAC Platform - Cost Optimization Guide
Overview
This document tracks cost optimization strategies for the MPAC platform's AWS infrastructure across three regions:
- mpac-smartpos — ap-northeast-1 (~$400-600/month staging, ~$2,000-3,500/month production)
- mpac-pgw — ap-northeast-1 (~$200-400/month staging, ~$1,500-2,500/month production)
- mpac-obs — ap-northeast-1 (~$150-300/month staging, ~$600-1,000/month production)
Total estimated monthly cost: ~$4,000-7,000/month (production, all regions)
1. Quick Wins (Implement Now)
1.1 Dev Environment Scheduling
The dev environment runs 24/7 but is only used during business hours (Mon-Fri, 09:00-18:00 JST). Save ~65% of dev compute costs by scheduling automatic start/stop.
Implementation: Add EventBridge rules to scale ECS services to 0 outside business hours.
# Stop dev ECS services at 18:00 JST (09:00 UTC) on weekdays
aws events put-rule \
--name "mpac-dev-stop" \
--schedule-expression "cron(0 9 ? * MON-FRI *)" \
--state ENABLED \
--region ap-northeast-1
# Start at 09:00 JST (00:00 UTC)
aws events put-rule \
--name "mpac-dev-start" \
--schedule-expression "cron(0 0 ? * MON-FRI *)" \
--state ENABLED \
--region ap-northeast-1Savings: ~$180-250/month on ECS compute (dev environment).
1.2 RDS Dev Instance Scheduling
Stop RDS dev instances outside business hours (RDS supports start/stop for up to 7 days).
# Stop dev RDS
aws rds stop-db-instance --db-instance-identifier mpac-smartpos-dev --region ap-northeast-1
aws rds stop-db-instance --db-instance-identifier mpac-pgw-dev --region ap-northeast-1
# Start dev RDS (run before dev start time)
aws rds start-db-instance --db-instance-identifier mpac-smartpos-dev --region ap-northeast-1Savings: ~$50-80/month (dev RDS instances).
1.3 Switch Dev ElastiCache to Single-Node
Dev environment doesn't need Multi-AZ ElastiCache. Use a single cache.t3.micro node.
Current dev.json already uses cache.t3.micro — verify it's single-node (no replication group).
Savings: ~$20-30/month.
2. Compute Optimization
2.1 ECS Fargate Spot (Staging)
Use Fargate Spot for staging workloads — up to 70% savings, with interruption handling.
# In ecs-stack.yaml (staging capacity provider)
CapacityProviders:
- FARGATE
- FARGATE_SPOT
DefaultCapacityProviderStrategy:
- CapacityProvider: FARGATE_SPOT
Weight: 4
- CapacityProvider: FARGATE
Weight: 1 # 20% on-demand for stabilitySavings: ~$300-500/month on staging ECS.
2.2 Production: Graviton2 (ARM) Instances
Switch RDS and ElastiCache to Graviton2 (r6g, m6g) instances — ~20% cheaper for same performance.
| Component | Current (x86) | Graviton2 | Savings |
|---|---|---|---|
| RDS prod (mpac-smartpos) | db.r5.large | db.r6g.large | ~20% |
| RDS prod (mpac-pgw) | db.r5.large | db.r6g.large | ~20% |
| ElastiCache prod | cache.r5.large | cache.r6g.large | ~20% |
Production parameter files already specify r6g and cache.r6g — verify ECS AMI is also ARM.
Savings: ~$150-250/month on production DB/cache.
2.3 Right-size Production ECS Tasks
Current production sizing based on peak capacity estimates. After 30 days of production traffic, review CloudWatch Container Insights and right-size:
# Check actual CPU/memory utilization over last 30 days
aws cloudwatch get-metric-statistics \
--namespace AWS/ECS \
--metric-name CPUUtilization \
--dimensions Name=ServiceName,Value=svc-smarttab Name=ClusterName,Value=mpac-smartpos-prod \
--start-time $(date -d '30 days ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date +%Y-%m-%dT%H:%M:%S) \
--period 86400 \
--statistics Average Maximum \
--region ap-northeast-1Target: Keep p99 CPU < 60% (leaves headroom for bursts without overprovisioning).
3. Database Cost Optimization
3.1 RDS Storage Auto-Scaling
Enable storage auto-scaling to avoid overprovisioning upfront:
# In ecs-stack.yaml
StorageEncrypted: true
StorageType: gp3
AllocatedStorage: 100 # Start conservative
MaxAllocatedStorage: 1000 # Auto-scale up to 1TBgp3 vs gp2: gp3 provides 3,000 IOPS baseline for free (vs gp2 which charges for IOPS above 3,000). Switch all RDS instances from gp2 to gp3 for ~20% storage cost savings.
3.2 Read Replicas for Reporting
The CLAUDE.md mentions 3 read replicas (1-2 for API reads, 1 for reporting). Use db.r6g.large for API read replicas, db.t3.medium for the reporting replica (lower cost).
3.3 Partition Pruning for pg_partman
Partitioned tables (payment_transactions, auth_logs) with 12-month online retention. Ensure old partitions are archived to S3 Parquet on schedule — unarchived old data increases RDS storage costs.
-- Check partition sizes (run periodically)
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS size
FROM pg_tables
WHERE tablename LIKE 'payment_transactions_%'
ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC;4. Data Transfer Cost Optimization
4.1 VPC Endpoints for AWS Services
Avoid NAT Gateway costs for AWS service traffic by using VPC endpoints:
| Service | Endpoint Type | Monthly Savings |
|---|---|---|
| S3 (Loki/Tempo data) | Gateway (free) | ~$20-50/month |
| ECR (image pulls) | Interface | ~$10-20/month |
| SSM Parameter Store | Interface | ~$5-10/month |
Add to CloudFormation stacks:
S3VpcEndpoint:
Type: AWS::EC2::VPCEndpoint
Properties:
VpcId: !Ref VPC
ServiceName: !Sub com.amazonaws.${AWS::Region}.s3
VpcEndpointType: Gateway
RouteTableIds:
- !Ref PrivateRouteTable4.2 CloudFront for Static Assets
Portal frontend static assets served via CloudFront (not ECS directly):
- Reduces ALB data transfer costs
- Improves global latency for SmartPOS terminals
Current: Assets served from ECS (estimating ~500GB/month out at $0.09/GB = ~$45/month). After CloudFront: CloudFront pricing ~$0.0085/GB = ~$4/month for same traffic.
Savings: ~$40/month.
5. Observability Cost Optimization (mpac-obs)
5.1 S3 Storage Classes for Loki/Tempo Data
Loki and Tempo data in S3 defaults to Standard storage. Add lifecycle policies:
{
"Rules": [
{
"Status": "Enabled",
"Transitions": [
{"Days": 7, "StorageClass": "STANDARD_IA"},
{"Days": 30, "StorageClass": "GLACIER_INSTANT_RETRIEVAL"}
],
"Filter": {"Prefix": "loki/"},
"Expiration": {"Days": 365}
}
]
}Savings: ~$30-60/month on S3 (Loki logs older than 7 days rarely queried).
5.2 Metric Cardinality Control
High cardinality metrics are the #1 Prometheus cost driver. Enforce cardinality limits in Alloy:
// In alloy/config.alloy — already configured
// Ensure label values don't include high-cardinality fields like:
// - individual transaction IDs
// - user IDs in labels (use log metadata instead)
// - full URL paths (use route template, not actual URL)5.3 Tempo Trace Sampling
At 80M transactions/day, storing all traces is cost-prohibitive. Use tail-based sampling: keep 100% of error traces, 1% of success traces.
# In tempo/config.yaml
tail_sampling:
policies:
- name: errors
type: status_code
status_code: {status_codes: [STATUS_CODE_ERROR]}
- name: slow_requests
type: latency
latency: {threshold_ms: 1000}
- name: probabilistic
type: probabilistic
probabilistic: {sampling_percentage: 1}Savings: ~$100-200/month on Tempo S3 storage.
6. Cost Monitoring & Alerts
6.1 AWS Budgets
Set up monthly cost alerts:
# Create budget for total platform spend
aws budgets create-budget \
--account-id $(aws sts get-caller-identity --query Account --output text) \
--budget '{
"BudgetName": "mpac-platform-monthly",
"BudgetLimit": {"Amount": "8000", "Unit": "USD"},
"TimeUnit": "MONTHLY",
"BudgetType": "COST"
}' \
--notifications-with-subscribers '[{
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 80
},
"Subscribers": [{"SubscriptionType": "EMAIL", "Address": "platform-team@yourorg.com"}]
}]'6.2 Cost Allocation Tags
Tag all resources for per-system cost breakdown:
| Tag | Values |
|---|---|
Project | mpac |
System | mpac-smartpos, mpac-pgw, mpac-obs |
Environment | dev, staging, production |
Verify tags are applied in CloudFormation stacks (check Tags: sections in ecs-stack.yaml files).
7. Reserved Instances / Savings Plans
After 3 months of production traffic data:
| Service | Recommendation | Savings vs On-Demand |
|---|---|---|
| ECS Fargate | Compute Savings Plan (1-year, no upfront) | ~20% |
| RDS | Reserved Instance (1-year, partial upfront) | ~35% |
| ElastiCache | Reserved Node (1-year, partial upfront) | ~35% |
Do not commit to Reserved Instances before 3 months of production data. Right-size first, then commit.
Estimated annual savings from RIs: ~$8,000-15,000/year at production scale.
8. Cost Review Schedule
| Frequency | Action |
|---|---|
| Weekly | Review AWS Cost Explorer for anomalies |
| Monthly | Review per-service cost breakdown, identify outliers |
| Quarterly | Right-size based on utilization data |
| Annually | Evaluate Reserved Instance renewals |
Use the mpac-platform-monthly budget alarm as the primary trigger for cost reviews.