Skip to content

MPAC Platform - Cost Optimization Guide

Overview

This document tracks cost optimization strategies for the MPAC platform's AWS infrastructure across three regions:

  • mpac-smartpos — ap-northeast-1 (~$400-600/month staging, ~$2,000-3,500/month production)
  • mpac-pgw — ap-northeast-1 (~$200-400/month staging, ~$1,500-2,500/month production)
  • mpac-obs — ap-northeast-1 (~$150-300/month staging, ~$600-1,000/month production)

Total estimated monthly cost: ~$4,000-7,000/month (production, all regions)


1. Quick Wins (Implement Now)

1.1 Dev Environment Scheduling

The dev environment runs 24/7 but is only used during business hours (Mon-Fri, 09:00-18:00 JST). Save ~65% of dev compute costs by scheduling automatic start/stop.

Implementation: Add EventBridge rules to scale ECS services to 0 outside business hours.

bash
# Stop dev ECS services at 18:00 JST (09:00 UTC) on weekdays
aws events put-rule \
  --name "mpac-dev-stop" \
  --schedule-expression "cron(0 9 ? * MON-FRI *)" \
  --state ENABLED \
  --region ap-northeast-1

# Start at 09:00 JST (00:00 UTC)
aws events put-rule \
  --name "mpac-dev-start" \
  --schedule-expression "cron(0 0 ? * MON-FRI *)" \
  --state ENABLED \
  --region ap-northeast-1

Savings: ~$180-250/month on ECS compute (dev environment).

1.2 RDS Dev Instance Scheduling

Stop RDS dev instances outside business hours (RDS supports start/stop for up to 7 days).

bash
# Stop dev RDS
aws rds stop-db-instance --db-instance-identifier mpac-smartpos-dev --region ap-northeast-1
aws rds stop-db-instance --db-instance-identifier mpac-pgw-dev --region ap-northeast-1

# Start dev RDS (run before dev start time)
aws rds start-db-instance --db-instance-identifier mpac-smartpos-dev --region ap-northeast-1

Savings: ~$50-80/month (dev RDS instances).

1.3 Switch Dev ElastiCache to Single-Node

Dev environment doesn't need Multi-AZ ElastiCache. Use a single cache.t3.micro node.

Current dev.json already uses cache.t3.micro — verify it's single-node (no replication group).

Savings: ~$20-30/month.


2. Compute Optimization

2.1 ECS Fargate Spot (Staging)

Use Fargate Spot for staging workloads — up to 70% savings, with interruption handling.

yaml
# In ecs-stack.yaml (staging capacity provider)
CapacityProviders:
  - FARGATE
  - FARGATE_SPOT
DefaultCapacityProviderStrategy:
  - CapacityProvider: FARGATE_SPOT
    Weight: 4
  - CapacityProvider: FARGATE
    Weight: 1  # 20% on-demand for stability

Savings: ~$300-500/month on staging ECS.

2.2 Production: Graviton2 (ARM) Instances

Switch RDS and ElastiCache to Graviton2 (r6g, m6g) instances — ~20% cheaper for same performance.

ComponentCurrent (x86)Graviton2Savings
RDS prod (mpac-smartpos)db.r5.largedb.r6g.large~20%
RDS prod (mpac-pgw)db.r5.largedb.r6g.large~20%
ElastiCache prodcache.r5.largecache.r6g.large~20%

Production parameter files already specify r6g and cache.r6g — verify ECS AMI is also ARM.

Savings: ~$150-250/month on production DB/cache.

2.3 Right-size Production ECS Tasks

Current production sizing based on peak capacity estimates. After 30 days of production traffic, review CloudWatch Container Insights and right-size:

bash
# Check actual CPU/memory utilization over last 30 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/ECS \
  --metric-name CPUUtilization \
  --dimensions Name=ServiceName,Value=svc-smarttab Name=ClusterName,Value=mpac-smartpos-prod \
  --start-time $(date -d '30 days ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date +%Y-%m-%dT%H:%M:%S) \
  --period 86400 \
  --statistics Average Maximum \
  --region ap-northeast-1

Target: Keep p99 CPU < 60% (leaves headroom for bursts without overprovisioning).


3. Database Cost Optimization

3.1 RDS Storage Auto-Scaling

Enable storage auto-scaling to avoid overprovisioning upfront:

yaml
# In ecs-stack.yaml
StorageEncrypted: true
StorageType: gp3
AllocatedStorage: 100  # Start conservative
MaxAllocatedStorage: 1000  # Auto-scale up to 1TB

gp3 vs gp2: gp3 provides 3,000 IOPS baseline for free (vs gp2 which charges for IOPS above 3,000). Switch all RDS instances from gp2 to gp3 for ~20% storage cost savings.

3.2 Read Replicas for Reporting

The CLAUDE.md mentions 3 read replicas (1-2 for API reads, 1 for reporting). Use db.r6g.large for API read replicas, db.t3.medium for the reporting replica (lower cost).

3.3 Partition Pruning for pg_partman

Partitioned tables (payment_transactions, auth_logs) with 12-month online retention. Ensure old partitions are archived to S3 Parquet on schedule — unarchived old data increases RDS storage costs.

sql
-- Check partition sizes (run periodically)
SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS size
FROM pg_tables
WHERE tablename LIKE 'payment_transactions_%'
ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC;

4. Data Transfer Cost Optimization

4.1 VPC Endpoints for AWS Services

Avoid NAT Gateway costs for AWS service traffic by using VPC endpoints:

ServiceEndpoint TypeMonthly Savings
S3 (Loki/Tempo data)Gateway (free)~$20-50/month
ECR (image pulls)Interface~$10-20/month
SSM Parameter StoreInterface~$5-10/month

Add to CloudFormation stacks:

yaml
S3VpcEndpoint:
  Type: AWS::EC2::VPCEndpoint
  Properties:
    VpcId: !Ref VPC
    ServiceName: !Sub com.amazonaws.${AWS::Region}.s3
    VpcEndpointType: Gateway
    RouteTableIds:
      - !Ref PrivateRouteTable

4.2 CloudFront for Static Assets

Portal frontend static assets served via CloudFront (not ECS directly):

  • Reduces ALB data transfer costs
  • Improves global latency for SmartPOS terminals

Current: Assets served from ECS (estimating ~500GB/month out at $0.09/GB = ~$45/month). After CloudFront: CloudFront pricing ~$0.0085/GB = ~$4/month for same traffic.

Savings: ~$40/month.


5. Observability Cost Optimization (mpac-obs)

5.1 S3 Storage Classes for Loki/Tempo Data

Loki and Tempo data in S3 defaults to Standard storage. Add lifecycle policies:

json
{
  "Rules": [
    {
      "Status": "Enabled",
      "Transitions": [
        {"Days": 7, "StorageClass": "STANDARD_IA"},
        {"Days": 30, "StorageClass": "GLACIER_INSTANT_RETRIEVAL"}
      ],
      "Filter": {"Prefix": "loki/"},
      "Expiration": {"Days": 365}
    }
  ]
}

Savings: ~$30-60/month on S3 (Loki logs older than 7 days rarely queried).

5.2 Metric Cardinality Control

High cardinality metrics are the #1 Prometheus cost driver. Enforce cardinality limits in Alloy:

alloy
// In alloy/config.alloy — already configured
// Ensure label values don't include high-cardinality fields like:
// - individual transaction IDs
// - user IDs in labels (use log metadata instead)
// - full URL paths (use route template, not actual URL)

5.3 Tempo Trace Sampling

At 80M transactions/day, storing all traces is cost-prohibitive. Use tail-based sampling: keep 100% of error traces, 1% of success traces.

yaml
# In tempo/config.yaml
tail_sampling:
  policies:
    - name: errors
      type: status_code
      status_code: {status_codes: [STATUS_CODE_ERROR]}
    - name: slow_requests
      type: latency
      latency: {threshold_ms: 1000}
    - name: probabilistic
      type: probabilistic
      probabilistic: {sampling_percentage: 1}

Savings: ~$100-200/month on Tempo S3 storage.


6. Cost Monitoring & Alerts

6.1 AWS Budgets

Set up monthly cost alerts:

bash
# Create budget for total platform spend
aws budgets create-budget \
  --account-id $(aws sts get-caller-identity --query Account --output text) \
  --budget '{
    "BudgetName": "mpac-platform-monthly",
    "BudgetLimit": {"Amount": "8000", "Unit": "USD"},
    "TimeUnit": "MONTHLY",
    "BudgetType": "COST"
  }' \
  --notifications-with-subscribers '[{
    "Notification": {
      "NotificationType": "ACTUAL",
      "ComparisonOperator": "GREATER_THAN",
      "Threshold": 80
    },
    "Subscribers": [{"SubscriptionType": "EMAIL", "Address": "platform-team@yourorg.com"}]
  }]'

6.2 Cost Allocation Tags

Tag all resources for per-system cost breakdown:

TagValues
Projectmpac
Systemmpac-smartpos, mpac-pgw, mpac-obs
Environmentdev, staging, production

Verify tags are applied in CloudFormation stacks (check Tags: sections in ecs-stack.yaml files).


7. Reserved Instances / Savings Plans

After 3 months of production traffic data:

ServiceRecommendationSavings vs On-Demand
ECS FargateCompute Savings Plan (1-year, no upfront)~20%
RDSReserved Instance (1-year, partial upfront)~35%
ElastiCacheReserved Node (1-year, partial upfront)~35%

Do not commit to Reserved Instances before 3 months of production data. Right-size first, then commit.

Estimated annual savings from RIs: ~$8,000-15,000/year at production scale.


8. Cost Review Schedule

FrequencyAction
WeeklyReview AWS Cost Explorer for anomalies
MonthlyReview per-service cost breakdown, identify outliers
QuarterlyRight-size based on utilization data
AnnuallyEvaluate Reserved Instance renewals

Use the mpac-platform-monthly budget alarm as the primary trigger for cost reviews.

MPAC — MP-Solution Advanced Cloud Service