Skip to content

AWS Infrastructure

Part of: MPAC SmartPOS Cloud Platform - Product RequirementsVersion: 2.0 Last Updated: 2026-01-28


Overview

This document details the AWS infrastructure architecture for the MPAC SmartPOS Cloud Platform. The infrastructure is designed to support 400,000+ concurrent devices, handle 80M transactions per day, and maintain 15,000 RPS sustained at peak. The architecture leverages AWS managed services for high availability, scalability, and operational efficiency.

Table of Contents


Compute

Purpose: Scalable container orchestration for all microservices.

ECS Fargate

  • Service: AWS ECS with Fargate launch type
  • Configuration: Serverless container execution
  • Benefits: No EC2 instance management, automatic scaling, pay-per-use

Auto Scaling Groups

  • Per Service: Each microservice has independent scaling policies
  • Scaling Triggers:
    • CPU utilization > 70%
    • Memory utilization > 80%
    • Custom CloudWatch metrics (request count, latency)

Target Tracking Scaling Policies

  • CPU Target: Maintain 60% average CPU utilization
  • Memory Target: Maintain 70% average memory utilization
  • Custom Metrics: Request count per target (ALB), WebSocket connections

Example Configuration:

yaml
# ECS Service Auto Scaling
ScalingPolicy:
  Type: AWS::ApplicationAutoScaling::ScalingPolicy
  Properties:
    PolicyType: TargetTrackingScaling
    TargetTrackingScalingPolicyConfiguration:
      PredefinedMetricSpecification:
        PredefinedMetricType: ECSServiceAverageCPUUtilization
      TargetValue: 60.0
      ScaleInCooldown: 300
      ScaleOutCooldown: 60

ECS Services Deployed

Application Services:

ServiceTask DefinitionMemoryCPUMinMaxPort
svc-portalsvc-portal:latest2 GB1 vCPU2208002
svc-smarttabsvc-smarttab:latest4 GB2 vCPU5508080
mpac-pgwmpac-pgw:latest2 GB1 vCPU3308080

Observability Services (mpac-obs):

ServiceTask DefinitionMemoryCPUReplicasPortPurpose
alloygrafana/alloy:1.12.02 GB1 vCPU24317/4318OTLP collector
prometheusprom/prometheus:3.8.04 GB2 vCPU29090Metrics storage
lokigrafana/loki:3.6.04 GB2 vCPU23100Log aggregation
tempografana/tempo:2.9.04 GB2 vCPU23200Trace storage
grafanagrafana/grafana:12.3.02 GB1 vCPU23000Visualization

Key Features:

  • Service Discovery: Prometheus auto-discovers services via ECS tags (prometheus.io/scrape=true)
  • Persistent Storage: EFS volumes for Prometheus, Loki, and Tempo data
  • High Availability: All observability services run with 2+ replicas across multiple AZs
  • OTLP Ingestion: Alloy receives metrics, logs, and traces on ports 4317 (gRPC) and 4318 (HTTP)

📄 Read more: Observability Stack


Database

Purpose: High-performance, highly-available relational database for all services.

RDS PostgreSQL

  • Engine Version: PostgreSQL 15+
  • Deployment: Multi-AZ for automatic failover
  • Instance Class: db.r6g.4xlarge (production)
    • Memory: 128 GiB
    • vCPUs: 16
    • Network: Up to 10 Gbps
  • Storage Configuration:
    • Type: Provisioned IOPS SSD (io2)
    • Initial Size: 1TB
    • Auto-scaling: Enabled (max 5TB)
    • IOPS: 10,000 baseline

Backup Strategy

  • Automated Snapshots: Daily at 03:00 UTC
  • Retention Period: 30 days
  • Point-in-Time Recovery: Enabled (5-minute granularity)
  • Cross-Region Backup: Weekly snapshots to DR region

Read Replicas

  • Count: 2 read replicas
  • Purpose: Distribute reporting and analytics queries
  • Replication: Asynchronous replication with < 1 second lag
  • Failover: Can be promoted to primary in disaster scenarios

Connection Pooling:

yaml
# PgBouncer configuration
PgBouncer:
  MaxClientConnections: 10000
  DefaultPoolSize: 25
  PoolMode: transaction

Cache

Purpose: Low-latency caching layer for session data, rate limiting, and frequently accessed data.

ElastiCache Redis

  • Engine Version: Redis 7+
  • Deployment Mode: Cluster mode enabled
  • Instance Type: cache.r6g.large
    • Memory: 13.07 GiB
    • vCPUs: 2
    • Network Performance: Up to 10 Gbps

High Availability

  • Multi-AZ: Enabled with automatic failover
  • Replication: 2 read replicas per shard
  • Sharding: 3 shards for horizontal scaling
  • Failover Time: < 30 seconds

Backup and Recovery

  • Snapshot Schedule: Daily at 04:00 UTC
  • Retention: 7 days
  • AOF (Append-Only File): Enabled for durability

Use Cases:

  • Session storage (JWT blacklist, user sessions)
  • Rate limiting counters
  • Idempotency key cache (24h TTL)
  • Frequently accessed merchant/store configs
  • Real-time device status

Load Balancing

Purpose: Distribute traffic across service instances with health monitoring.

Application Load Balancer (ALB)

  • Use Case: HTTP/HTTPS traffic for REST APIs
  • Listeners:
    • Port 443 (HTTPS) - Primary
    • Port 80 (HTTP) - Redirect to 443
  • SSL/TLS:
    • Certificate: AWS ACM managed
    • Security Policy: ELBSecurityPolicy-TLS13-1-2-2021-06

Target Groups

  • svc-portal: Port 8002
  • svc-smarttab: Port 8001
  • mpac-pgw: Port 8003
  • mpac-frontend: Port 3000

Network Load Balancer (NLB)

  • Use Case: WebSocket connections (if needed for persistent connections)
  • Listeners:
    • Port 443 (TLS termination)
  • Connection Draining: 300 seconds

Health Checks

  • Endpoint: GET /health
  • Interval: 30 seconds
  • Timeout: 5 seconds
  • Healthy Threshold: 2 consecutive successes
  • Unhealthy Threshold: 3 consecutive failures

Health Check Response:

json
{
  "status": "healthy",
  "service": "svc-portal",
  "version": "2.1.0",
  "timestamp": "2026-01-28T10:00:00Z",
  "checks": {
    "database": "ok",
    "redis": "ok",
    "external_api": "ok"
  }
}

Storage

Purpose: Durable storage for files, reports, static assets, and observability data.

Elastic File System (EFS)

Purpose: Persistent shared storage for mpac-obs observability stack.

Configuration:

  • Performance Mode: General Purpose
  • Throughput Mode: Bursting (250 MiB/s baseline, bursts to 600 MiB/s)
  • Encryption: At-rest encryption with AWS KMS
  • Backup: AWS Backup with daily snapshots (14-day retention)
  • Lifecycle Management: Transition to Infrequent Access after 30 days

Mount Targets:

  • Availability Zones: Multi-AZ (ap-southeast-1a, ap-southeast-1b)
  • Security Groups: Allow NFS port 2049 from ECS tasks
  • DNS: Auto-generated EFS DNS name for mounting

EFS File Systems:

File SystemPurposeSize (Est.)Mount PathLifecycle
mpac-obs-prometheusPrometheus TSDB100 GB/mnt/prometheus14-day retention
mpac-obs-lokiLoki chunks & index200 GB/mnt/loki14-day retention
mpac-obs-tempoTempo trace blocks400 GB/mnt/tempo7-day retention
mpac-obs-grafanaGrafana dashboards10 GB/var/lib/grafanaPermanent

ECS Task Volume Configuration:

yaml
# ECS Task Definition for Prometheus
volumes:
  - name: prometheus-storage
    efsVolumeConfiguration:
      fileSystemId: fs-0abc123def456
      rootDirectory: /prometheus
      transitEncryption: ENABLED
      authorizationConfig:
        iam: ENABLED

containerDefinitions:
  - name: prometheus
    mountPoints:
      - sourceVolume: prometheus-storage
        containerPath: /prometheus
        readOnly: false

Cost Estimate (Monthly):

  • 710 GB Standard Storage: ~$210
  • Data Transfer: ~$50
  • Total: ~$260/month

S3 Buckets

mpac-app-apk-

  • Purpose: Android APK files for terminal updates
  • Versioning: Enabled
  • Lifecycle:
    • Keep latest 10 versions
    • Archive versions > 90 days to Glacier
  • Access: CloudFront distribution for fast downloads

mpac-receipts-

  • Purpose: Digital receipt and invoice PDFs
  • Retention: 7 years (regulatory requirement)
  • Encryption: SSE-S3 (AES-256)
  • Lifecycle:
    • Transition to Infrequent Access after 90 days
    • Archive to Glacier after 1 year

mpac-settlements-

  • Purpose: Daily settlement reports and CSVs
  • Retention: 10 years
  • Access Pattern: High read in first 30 days, archive after
  • Lifecycle:
    • Infrequent Access after 30 days
    • Glacier after 1 year

mpac-exports-

  • Purpose: User-generated export files (reports, data dumps)
  • Retention: 7 days
  • Lifecycle: Automatic deletion after 7 days

mpac-static-assets-

  • Purpose: Frontend static files (JS, CSS, images)
  • Versioning: Enabled
  • CloudFront: Distribution with edge caching
  • Cache-Control: max-age=31536000, immutable

Bucket Policies

json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EnforceSSLOnly",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:*",
      "Resource": "arn:aws:s3:::mpac-receipts-prod/*",
      "Condition": {
        "Bool": {
          "aws:SecureTransport": "false"
        }
      }
    }
  ]
}

Networking

Purpose: Secure network isolation and connectivity for all services.

VPC Architecture

  • CIDR Block: 10.0.0.0/16
  • Subnets:
    • Public Subnets: 3 AZs (10.0.1.0/24, 10.0.2.0/24, 10.0.3.0/24)
      • Use: Load balancers, NAT gateways
    • Private Subnets: 3 AZs (10.0.10.0/24, 10.0.11.0/24, 10.0.12.0/24)
      • Use: ECS tasks, application services
    • Database Subnets: 3 AZs (10.0.20.0/24, 10.0.21.0/24, 10.0.22.0/24)
      • Use: RDS, ElastiCache

NAT Gateway

  • Purpose: Outbound internet access for private subnets
  • Deployment: 1 per AZ for high availability
  • Bandwidth: Auto-scaling up to 45 Gbps

Security Groups

ALB Security Group

  • Inbound: 443 from 0.0.0.0/0, 80 from 0.0.0.0/0
  • Outbound: All traffic to ECS security group

ECS Security Group

  • Inbound: Ports 8001-8003 from ALB security group
  • Outbound: 443 to 0.0.0.0/0, 5432 to RDS security group, 6379 to Redis security group

RDS Security Group

  • Inbound: 5432 from ECS security group
  • Outbound: None

ElastiCache Security Group

  • Inbound: 6379 from ECS security group
  • Outbound: None

VPC Peering

  • Use Case: Cross-region connectivity for DR scenario
  • Configuration: Peering connection between production VPC and DR VPC
  • Route Tables: Updated to route DR region CIDR through peering connection

VPC Flow Logs

  • Destination: CloudWatch Logs
  • Traffic Type: ALL (accepted and rejected)
  • Retention: 30 days
  • Use Case: Security monitoring, troubleshooting

DNS

Purpose: Domain name resolution and traffic routing with health-based failover.

Route 53 Hosted Zones

Production Zone: mpac-cloud.com

  • Records:
    • portal.mpac-cloud.com → ALB alias record
    • api.mpac-cloud.com → ALB alias record (svc-smarttab, mpac-pgw)
    • cdn.mpac-cloud.com → CloudFront distribution

Staging Zone: mpac-cloud-stg.com

  • Purpose: Pre-production environment
  • Isolation: Separate hosted zone for staging

Alias Records

yaml
# Route 53 Alias Record for ALB
PortalDNSRecord:
  Type: AWS::Route53::RecordSet
  Properties:
    HostedZoneId: !Ref HostedZoneId
    Name: portal.mpac-cloud.com
    Type: A
    AliasTarget:
      DNSName: !GetAtt ApplicationLoadBalancer.DNSName
      HostedZoneId: !GetAtt ApplicationLoadBalancer.CanonicalHostedZoneID
      EvaluateTargetHealth: true

Health Checks for Failover

  • Endpoint: https://portal.mpac-cloud.com/health
  • Protocol: HTTPS
  • Interval: 30 seconds
  • Failure Threshold: 3
  • Regions: 3 geographically distributed health checkers

Failover Routing Policy

yaml
# Primary Region (Tokyo)
PrimaryRecord:
  Type: AWS::Route53::RecordSet
  Properties:
    SetIdentifier: "primary-tokyo"
    Failover: PRIMARY
    HealthCheckId: !Ref HealthCheck

# Secondary Region (Osaka) - DR
SecondaryRecord:
  Type: AWS::Route53::RecordSet
  Properties:
    SetIdentifier: "secondary-osaka"
    Failover: SECONDARY

Failover Behavior:

  • Primary region healthy: All traffic to primary
  • Primary region unhealthy: Automatic failover to DR region within 60-90 seconds
  • Primary region recovered: Manual failback after verification

Cross-References


Next: Deployment StrategyUp: Deployment Index

MPAC — MP-Solution Advanced Cloud Service