MPAC Platform - Cost Optimization Guide

Overview

This document tracks cost optimization strategies for the MPAC platform's AWS infrastructure across three regions:

mpac-smartpos — ap-northeast-1 (~$400-600/month staging, ~$2,000-3,500/month production)
mpac-pgw — ap-northeast-1 (~$200-400/month staging, ~$1,500-2,500/month production)
mpac-obs — ap-northeast-1 (~$150-300/month staging, ~$600-1,000/month production)

Total estimated monthly cost: ~$4,000-7,000/month (production, all regions)

1. Quick Wins (Implement Now)

1.1 Dev Environment Scheduling

The dev environment runs 24/7 but is only used during business hours (Mon-Fri, 09:00-18:00 JST). Save ~65% of dev compute costs by scheduling automatic start/stop.

Implementation: Add EventBridge rules to scale ECS services to 0 outside business hours.

bash

# Stop dev ECS services at 18:00 JST (09:00 UTC) on weekdays
aws events put-rule \
  --name "mpac-dev-stop" \
  --schedule-expression "cron(0 9 ? * MON-FRI *)" \
  --state ENABLED \
  --region ap-northeast-1

# Start at 09:00 JST (00:00 UTC)
aws events put-rule \
  --name "mpac-dev-start" \
  --schedule-expression "cron(0 0 ? * MON-FRI *)" \
  --state ENABLED \
  --region ap-northeast-1

Savings: ~$180-250/month on ECS compute (dev environment).

1.2 RDS Dev Instance Scheduling

Stop RDS dev instances outside business hours (RDS supports start/stop for up to 7 days).

bash

# Stop dev RDS
aws rds stop-db-instance --db-instance-identifier mpac-smartpos-dev --region ap-northeast-1
aws rds stop-db-instance --db-instance-identifier mpac-pgw-dev --region ap-northeast-1

# Start dev RDS (run before dev start time)
aws rds start-db-instance --db-instance-identifier mpac-smartpos-dev --region ap-northeast-1

Savings: ~$50-80/month (dev RDS instances).

1.3 Switch Dev ElastiCache to Single-Node

Dev environment doesn't need Multi-AZ ElastiCache. Use a single cache.t3.micro node.

Current dev.json already uses cache.t3.micro — verify it's single-node (no replication group).

Savings: ~$20-30/month.

2. Compute Optimization

2.1 ECS Fargate Spot (Staging)

Use Fargate Spot for staging workloads — up to 70% savings, with interruption handling.

yaml

# In ecs-stack.yaml (staging capacity provider)
CapacityProviders:
  - FARGATE
  - FARGATE_SPOT
DefaultCapacityProviderStrategy:
  - CapacityProvider: FARGATE_SPOT
    Weight: 4
  - CapacityProvider: FARGATE
    Weight: 1  # 20% on-demand for stability

Savings: ~$300-500/month on staging ECS.

2.2 Production: Graviton2 (ARM) Instances

Switch RDS and ElastiCache to Graviton2 (r6g, m6g) instances — ~20% cheaper for same performance.

Component	Current (x86)	Graviton2	Savings
RDS prod (mpac-smartpos)	db.r5.large	db.r6g.large	~20%
RDS prod (mpac-pgw)	db.r5.large	db.r6g.large	~20%
ElastiCache prod	cache.r5.large	cache.r6g.large	~20%

Production parameter files already specify r6g and cache.r6g — verify ECS AMI is also ARM.

Savings: ~$150-250/month on production DB/cache.

2.3 Right-size Production ECS Tasks

Current production sizing based on peak capacity estimates. After 30 days of production traffic, review CloudWatch Container Insights and right-size:

bash

# Check actual CPU/memory utilization over last 30 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/ECS \
  --metric-name CPUUtilization \
  --dimensions Name=ServiceName,Value=svc-smarttab Name=ClusterName,Value=mpac-smartpos-prod \
  --start-time $(date -d '30 days ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date +%Y-%m-%dT%H:%M:%S) \
  --period 86400 \
  --statistics Average Maximum \
  --region ap-northeast-1

Target: Keep p99 CPU < 60% (leaves headroom for bursts without overprovisioning).

3. Database Cost Optimization

3.1 RDS Storage Auto-Scaling

Enable storage auto-scaling to avoid overprovisioning upfront:

yaml

# In ecs-stack.yaml
StorageEncrypted: true
StorageType: gp3
AllocatedStorage: 100  # Start conservative
MaxAllocatedStorage: 1000  # Auto-scale up to 1TB

gp3 vs gp2: gp3 provides 3,000 IOPS baseline for free (vs gp2 which charges for IOPS above 3,000). Switch all RDS instances from gp2 to gp3 for ~20% storage cost savings.

3.2 Read Replicas for Reporting

The CLAUDE.md mentions 3 read replicas (1-2 for API reads, 1 for reporting). Use db.r6g.large for API read replicas, db.t3.medium for the reporting replica (lower cost).

3.3 Partition Pruning for pg_partman

Partitioned tables (payment_transactions, auth_logs) with 12-month online retention. Ensure old partitions are archived to S3 Parquet on schedule — unarchived old data increases RDS storage costs.

sql

-- Check partition sizes (run periodically)
SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS size
FROM pg_tables
WHERE tablename LIKE 'payment_transactions_%'
ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC;

4. Data Transfer Cost Optimization

4.1 VPC Endpoints for AWS Services

Avoid NAT Gateway costs for AWS service traffic by using VPC endpoints:

Service	Endpoint Type	Monthly Savings
S3 (Loki/Tempo data)	Gateway (free)	~$20-50/month
ECR (image pulls)	Interface	~$10-20/month
SSM Parameter Store	Interface	~$5-10/month

Add to CloudFormation stacks:

yaml

S3VpcEndpoint:
  Type: AWS::EC2::VPCEndpoint
  Properties:
    VpcId: !Ref VPC
    ServiceName: !Sub com.amazonaws.${AWS::Region}.s3
    VpcEndpointType: Gateway
    RouteTableIds:
      - !Ref PrivateRouteTable

4.2 CloudFront for Static Assets

Portal frontend static assets served via CloudFront (not ECS directly):

Reduces ALB data transfer costs
Improves global latency for SmartPOS terminals

Current: Assets served from ECS (estimating ~500GB/month out at $0.09/GB = ~$45/month). After CloudFront: CloudFront pricing ~$0.0085/GB = ~$4/month for same traffic.

Savings: ~$40/month.

5. Observability Cost Optimization (mpac-obs)

5.1 S3 Storage Classes for Loki/Tempo Data

Loki and Tempo data in S3 defaults to Standard storage. Add lifecycle policies:

json

{
  "Rules": [
    {
      "Status": "Enabled",
      "Transitions": [
        {"Days": 7, "StorageClass": "STANDARD_IA"},
        {"Days": 30, "StorageClass": "GLACIER_INSTANT_RETRIEVAL"}
      ],
      "Filter": {"Prefix": "loki/"},
      "Expiration": {"Days": 365}
    }
  ]
}

Savings: ~$30-60/month on S3 (Loki logs older than 7 days rarely queried).

5.2 Metric Cardinality Control

High cardinality metrics are the #1 Prometheus cost driver. Enforce cardinality limits in Alloy:

alloy

// In alloy/config.alloy — already configured
// Ensure label values don't include high-cardinality fields like:
// - individual transaction IDs
// - user IDs in labels (use log metadata instead)
// - full URL paths (use route template, not actual URL)

5.3 Tempo Trace Sampling

At 80M transactions/day, storing all traces is cost-prohibitive. Use tail-based sampling: keep 100% of error traces, 1% of success traces.

yaml

# In tempo/config.yaml
tail_sampling:
  policies:
    - name: errors
      type: status_code
      status_code: {status_codes: [STATUS_CODE_ERROR]}
    - name: slow_requests
      type: latency
      latency: {threshold_ms: 1000}
    - name: probabilistic
      type: probabilistic
      probabilistic: {sampling_percentage: 1}

Savings: ~$100-200/month on Tempo S3 storage.

6. Cost Monitoring & Alerts

6.1 AWS Budgets

Set up monthly cost alerts:

bash

# Create budget for total platform spend
aws budgets create-budget \
  --account-id $(aws sts get-caller-identity --query Account --output text) \
  --budget '{
    "BudgetName": "mpac-platform-monthly",
    "BudgetLimit": {"Amount": "8000", "Unit": "USD"},
    "TimeUnit": "MONTHLY",
    "BudgetType": "COST"
  }' \
  --notifications-with-subscribers '[{
    "Notification": {
      "NotificationType": "ACTUAL",
      "ComparisonOperator": "GREATER_THAN",
      "Threshold": 80
    },
    "Subscribers": [{"SubscriptionType": "EMAIL", "Address": "platform-team@yourorg.com"}]
  }]'

6.2 Cost Allocation Tags

Tag all resources for per-system cost breakdown:

Tag	Values
`Project`	`mpac`
`System`	`mpac-smartpos`, `mpac-pgw`, `mpac-obs`
`Environment`	`dev`, `staging`, `production`

Verify tags are applied in CloudFormation stacks (check Tags: sections in ecs-stack.yaml files).

7. Reserved Instances / Savings Plans

After 3 months of production traffic data:

Service	Recommendation	Savings vs On-Demand
ECS Fargate	Compute Savings Plan (1-year, no upfront)	~20%
RDS	Reserved Instance (1-year, partial upfront)	~35%
ElastiCache	Reserved Node (1-year, partial upfront)	~35%

Do not commit to Reserved Instances before 3 months of production data. Right-size first, then commit.

Estimated annual savings from RIs: ~$8,000-15,000/year at production scale.

8. Cost Review Schedule

Frequency	Action
Weekly	Review AWS Cost Explorer for anomalies
Monthly	Review per-service cost breakdown, identify outliers
Quarterly	Right-size based on utilization data
Annually	Evaluate Reserved Instance renewals

Use the mpac-platform-monthly budget alarm as the primary trigger for cost reviews.

MPAC Platform - Cost Optimization Guide ​

Overview ​

1. Quick Wins (Implement Now) ​

1.1 Dev Environment Scheduling ​

1.2 RDS Dev Instance Scheduling ​

1.3 Switch Dev ElastiCache to Single-Node ​

2. Compute Optimization ​

2.1 ECS Fargate Spot (Staging) ​

2.2 Production: Graviton2 (ARM) Instances ​

2.3 Right-size Production ECS Tasks ​

3. Database Cost Optimization ​

3.1 RDS Storage Auto-Scaling ​

3.2 Read Replicas for Reporting ​

3.3 Partition Pruning for pg_partman ​

4. Data Transfer Cost Optimization ​

4.1 VPC Endpoints for AWS Services ​

4.2 CloudFront for Static Assets ​

5. Observability Cost Optimization (mpac-obs) ​

5.1 S3 Storage Classes for Loki/Tempo Data ​

5.2 Metric Cardinality Control ​

5.3 Tempo Trace Sampling ​

6. Cost Monitoring & Alerts ​

6.1 AWS Budgets ​

6.2 Cost Allocation Tags ​

7. Reserved Instances / Savings Plans ​

8. Cost Review Schedule ​

MPAC Platform - Cost Optimization Guide

Overview

1. Quick Wins (Implement Now)

1.1 Dev Environment Scheduling

1.2 RDS Dev Instance Scheduling

1.3 Switch Dev ElastiCache to Single-Node

2. Compute Optimization

2.1 ECS Fargate Spot (Staging)

2.2 Production: Graviton2 (ARM) Instances

2.3 Right-size Production ECS Tasks

3. Database Cost Optimization

3.1 RDS Storage Auto-Scaling

3.2 Read Replicas for Reporting

3.3 Partition Pruning for pg_partman

4. Data Transfer Cost Optimization

4.1 VPC Endpoints for AWS Services

4.2 CloudFront for Static Assets

5. Observability Cost Optimization (mpac-obs)

5.1 S3 Storage Classes for Loki/Tempo Data

5.2 Metric Cardinality Control

5.3 Tempo Trace Sampling

6. Cost Monitoring & Alerts

6.1 AWS Budgets

6.2 Cost Allocation Tags

7. Reserved Instances / Savings Plans

8. Cost Review Schedule