AWS Infrastructure
Part of: MPAC SmartPOS Cloud Platform - Product RequirementsVersion: 2.0 Last Updated: 2026-01-28
Overview
This document details the AWS infrastructure architecture for the MPAC SmartPOS Cloud Platform. The infrastructure is designed to support 400,000+ concurrent devices, handle 80M transactions per day, and maintain 15,000 RPS sustained at peak. The architecture leverages AWS managed services for high availability, scalability, and operational efficiency.
Table of Contents
Compute
Purpose: Scalable container orchestration for all microservices.
ECS Fargate
- Service: AWS ECS with Fargate launch type
- Configuration: Serverless container execution
- Benefits: No EC2 instance management, automatic scaling, pay-per-use
Auto Scaling Groups
- Per Service: Each microservice has independent scaling policies
- Scaling Triggers:
- CPU utilization > 70%
- Memory utilization > 80%
- Custom CloudWatch metrics (request count, latency)
Target Tracking Scaling Policies
- CPU Target: Maintain 60% average CPU utilization
- Memory Target: Maintain 70% average memory utilization
- Custom Metrics: Request count per target (ALB), WebSocket connections
Example Configuration:
# ECS Service Auto Scaling
ScalingPolicy:
Type: AWS::ApplicationAutoScaling::ScalingPolicy
Properties:
PolicyType: TargetTrackingScaling
TargetTrackingScalingPolicyConfiguration:
PredefinedMetricSpecification:
PredefinedMetricType: ECSServiceAverageCPUUtilization
TargetValue: 60.0
ScaleInCooldown: 300
ScaleOutCooldown: 60ECS Services Deployed
Application Services:
| Service | Task Definition | Memory | CPU | Min | Max | Port |
|---|---|---|---|---|---|---|
| svc-portal | svc-portal:latest | 2 GB | 1 vCPU | 2 | 20 | 8002 |
| svc-smarttab | svc-smarttab:latest | 4 GB | 2 vCPU | 5 | 50 | 8080 |
| mpac-pgw | mpac-pgw:latest | 2 GB | 1 vCPU | 3 | 30 | 8080 |
Observability Services (mpac-obs):
| Service | Task Definition | Memory | CPU | Replicas | Port | Purpose |
|---|---|---|---|---|---|---|
| alloy | grafana/alloy:1.12.0 | 2 GB | 1 vCPU | 2 | 4317/4318 | OTLP collector |
| prometheus | prom/prometheus:3.8.0 | 4 GB | 2 vCPU | 2 | 9090 | Metrics storage |
| loki | grafana/loki:3.6.0 | 4 GB | 2 vCPU | 2 | 3100 | Log aggregation |
| tempo | grafana/tempo:2.9.0 | 4 GB | 2 vCPU | 2 | 3200 | Trace storage |
| grafana | grafana/grafana:12.3.0 | 2 GB | 1 vCPU | 2 | 3000 | Visualization |
Key Features:
- Service Discovery: Prometheus auto-discovers services via ECS tags (
prometheus.io/scrape=true) - Persistent Storage: EFS volumes for Prometheus, Loki, and Tempo data
- High Availability: All observability services run with 2+ replicas across multiple AZs
- OTLP Ingestion: Alloy receives metrics, logs, and traces on ports 4317 (gRPC) and 4318 (HTTP)
📄 Read more: Observability Stack
Database
Purpose: High-performance, highly-available relational database for all services.
RDS PostgreSQL
- Engine Version: PostgreSQL 15+
- Deployment: Multi-AZ for automatic failover
- Instance Class: db.r6g.4xlarge (production)
- Memory: 128 GiB
- vCPUs: 16
- Network: Up to 10 Gbps
- Storage Configuration:
- Type: Provisioned IOPS SSD (io2)
- Initial Size: 1TB
- Auto-scaling: Enabled (max 5TB)
- IOPS: 10,000 baseline
Backup Strategy
- Automated Snapshots: Daily at 03:00 UTC
- Retention Period: 30 days
- Point-in-Time Recovery: Enabled (5-minute granularity)
- Cross-Region Backup: Weekly snapshots to DR region
Read Replicas
- Count: 2 read replicas
- Purpose: Distribute reporting and analytics queries
- Replication: Asynchronous replication with < 1 second lag
- Failover: Can be promoted to primary in disaster scenarios
Connection Pooling:
# PgBouncer configuration
PgBouncer:
MaxClientConnections: 10000
DefaultPoolSize: 25
PoolMode: transactionCache
Purpose: Low-latency caching layer for session data, rate limiting, and frequently accessed data.
ElastiCache Redis
- Engine Version: Redis 7+
- Deployment Mode: Cluster mode enabled
- Instance Type: cache.r6g.large
- Memory: 13.07 GiB
- vCPUs: 2
- Network Performance: Up to 10 Gbps
High Availability
- Multi-AZ: Enabled with automatic failover
- Replication: 2 read replicas per shard
- Sharding: 3 shards for horizontal scaling
- Failover Time: < 30 seconds
Backup and Recovery
- Snapshot Schedule: Daily at 04:00 UTC
- Retention: 7 days
- AOF (Append-Only File): Enabled for durability
Use Cases:
- Session storage (JWT blacklist, user sessions)
- Rate limiting counters
- Idempotency key cache (24h TTL)
- Frequently accessed merchant/store configs
- Real-time device status
Load Balancing
Purpose: Distribute traffic across service instances with health monitoring.
Application Load Balancer (ALB)
- Use Case: HTTP/HTTPS traffic for REST APIs
- Listeners:
- Port 443 (HTTPS) - Primary
- Port 80 (HTTP) - Redirect to 443
- SSL/TLS:
- Certificate: AWS ACM managed
- Security Policy: ELBSecurityPolicy-TLS13-1-2-2021-06
Target Groups
- svc-portal: Port 8002
- svc-smarttab: Port 8001
- mpac-pgw: Port 8003
- mpac-frontend: Port 3000
Network Load Balancer (NLB)
- Use Case: WebSocket connections (if needed for persistent connections)
- Listeners:
- Port 443 (TLS termination)
- Connection Draining: 300 seconds
Health Checks
- Endpoint:
GET /health - Interval: 30 seconds
- Timeout: 5 seconds
- Healthy Threshold: 2 consecutive successes
- Unhealthy Threshold: 3 consecutive failures
Health Check Response:
{
"status": "healthy",
"service": "svc-portal",
"version": "2.1.0",
"timestamp": "2026-01-28T10:00:00Z",
"checks": {
"database": "ok",
"redis": "ok",
"external_api": "ok"
}
}Storage
Purpose: Durable storage for files, reports, static assets, and observability data.
Elastic File System (EFS)
Purpose: Persistent shared storage for mpac-obs observability stack.
Configuration:
- Performance Mode: General Purpose
- Throughput Mode: Bursting (250 MiB/s baseline, bursts to 600 MiB/s)
- Encryption: At-rest encryption with AWS KMS
- Backup: AWS Backup with daily snapshots (14-day retention)
- Lifecycle Management: Transition to Infrequent Access after 30 days
Mount Targets:
- Availability Zones: Multi-AZ (ap-southeast-1a, ap-southeast-1b)
- Security Groups: Allow NFS port 2049 from ECS tasks
- DNS: Auto-generated EFS DNS name for mounting
EFS File Systems:
| File System | Purpose | Size (Est.) | Mount Path | Lifecycle |
|---|---|---|---|---|
| mpac-obs-prometheus | Prometheus TSDB | 100 GB | /mnt/prometheus | 14-day retention |
| mpac-obs-loki | Loki chunks & index | 200 GB | /mnt/loki | 14-day retention |
| mpac-obs-tempo | Tempo trace blocks | 400 GB | /mnt/tempo | 7-day retention |
| mpac-obs-grafana | Grafana dashboards | 10 GB | /var/lib/grafana | Permanent |
ECS Task Volume Configuration:
# ECS Task Definition for Prometheus
volumes:
- name: prometheus-storage
efsVolumeConfiguration:
fileSystemId: fs-0abc123def456
rootDirectory: /prometheus
transitEncryption: ENABLED
authorizationConfig:
iam: ENABLED
containerDefinitions:
- name: prometheus
mountPoints:
- sourceVolume: prometheus-storage
containerPath: /prometheus
readOnly: falseCost Estimate (Monthly):
- 710 GB Standard Storage: ~$210
- Data Transfer: ~$50
- Total: ~$260/month
S3 Buckets
mpac-app-apk-
- Purpose: Android APK files for terminal updates
- Versioning: Enabled
- Lifecycle:
- Keep latest 10 versions
- Archive versions > 90 days to Glacier
- Access: CloudFront distribution for fast downloads
mpac-receipts-
- Purpose: Digital receipt and invoice PDFs
- Retention: 7 years (regulatory requirement)
- Encryption: SSE-S3 (AES-256)
- Lifecycle:
- Transition to Infrequent Access after 90 days
- Archive to Glacier after 1 year
mpac-settlements-
- Purpose: Daily settlement reports and CSVs
- Retention: 10 years
- Access Pattern: High read in first 30 days, archive after
- Lifecycle:
- Infrequent Access after 30 days
- Glacier after 1 year
mpac-exports-
- Purpose: User-generated export files (reports, data dumps)
- Retention: 7 days
- Lifecycle: Automatic deletion after 7 days
mpac-static-assets-
- Purpose: Frontend static files (JS, CSS, images)
- Versioning: Enabled
- CloudFront: Distribution with edge caching
- Cache-Control:
max-age=31536000, immutable
Bucket Policies
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "EnforceSSLOnly",
"Effect": "Deny",
"Principal": "*",
"Action": "s3:*",
"Resource": "arn:aws:s3:::mpac-receipts-prod/*",
"Condition": {
"Bool": {
"aws:SecureTransport": "false"
}
}
}
]
}Networking
Purpose: Secure network isolation and connectivity for all services.
VPC Architecture
- CIDR Block: 10.0.0.0/16
- Subnets:
- Public Subnets: 3 AZs (10.0.1.0/24, 10.0.2.0/24, 10.0.3.0/24)
- Use: Load balancers, NAT gateways
- Private Subnets: 3 AZs (10.0.10.0/24, 10.0.11.0/24, 10.0.12.0/24)
- Use: ECS tasks, application services
- Database Subnets: 3 AZs (10.0.20.0/24, 10.0.21.0/24, 10.0.22.0/24)
- Use: RDS, ElastiCache
- Public Subnets: 3 AZs (10.0.1.0/24, 10.0.2.0/24, 10.0.3.0/24)
NAT Gateway
- Purpose: Outbound internet access for private subnets
- Deployment: 1 per AZ for high availability
- Bandwidth: Auto-scaling up to 45 Gbps
Security Groups
ALB Security Group
- Inbound: 443 from 0.0.0.0/0, 80 from 0.0.0.0/0
- Outbound: All traffic to ECS security group
ECS Security Group
- Inbound: Ports 8001-8003 from ALB security group
- Outbound: 443 to 0.0.0.0/0, 5432 to RDS security group, 6379 to Redis security group
RDS Security Group
- Inbound: 5432 from ECS security group
- Outbound: None
ElastiCache Security Group
- Inbound: 6379 from ECS security group
- Outbound: None
VPC Peering
- Use Case: Cross-region connectivity for DR scenario
- Configuration: Peering connection between production VPC and DR VPC
- Route Tables: Updated to route DR region CIDR through peering connection
VPC Flow Logs
- Destination: CloudWatch Logs
- Traffic Type: ALL (accepted and rejected)
- Retention: 30 days
- Use Case: Security monitoring, troubleshooting
DNS
Purpose: Domain name resolution and traffic routing with health-based failover.
Route 53 Hosted Zones
Production Zone: mpac-cloud.com
- Records:
portal.mpac-cloud.com→ ALB alias recordapi.mpac-cloud.com→ ALB alias record (svc-smarttab, mpac-pgw)cdn.mpac-cloud.com→ CloudFront distribution
Staging Zone: mpac-cloud-stg.com
- Purpose: Pre-production environment
- Isolation: Separate hosted zone for staging
Alias Records
# Route 53 Alias Record for ALB
PortalDNSRecord:
Type: AWS::Route53::RecordSet
Properties:
HostedZoneId: !Ref HostedZoneId
Name: portal.mpac-cloud.com
Type: A
AliasTarget:
DNSName: !GetAtt ApplicationLoadBalancer.DNSName
HostedZoneId: !GetAtt ApplicationLoadBalancer.CanonicalHostedZoneID
EvaluateTargetHealth: trueHealth Checks for Failover
- Endpoint:
https://portal.mpac-cloud.com/health - Protocol: HTTPS
- Interval: 30 seconds
- Failure Threshold: 3
- Regions: 3 geographically distributed health checkers
Failover Routing Policy
# Primary Region (Tokyo)
PrimaryRecord:
Type: AWS::Route53::RecordSet
Properties:
SetIdentifier: "primary-tokyo"
Failover: PRIMARY
HealthCheckId: !Ref HealthCheck
# Secondary Region (Osaka) - DR
SecondaryRecord:
Type: AWS::Route53::RecordSet
Properties:
SetIdentifier: "secondary-osaka"
Failover: SECONDARYFailover Behavior:
- Primary region healthy: All traffic to primary
- Primary region unhealthy: Automatic failover to DR region within 60-90 seconds
- Primary region recovered: Manual failback after verification
Cross-References
Related Sections
- Deployment Strategy - Blue-green deployment process
- Environments - Environment-specific configurations
- CI/CD Pipeline - Automated deployment workflows
Related Technical Sections
- Security Architecture - Network security, encryption
- Performance & Scalability - Capacity planning
- Database Architecture - RDS configuration details
Navigation
Next: Deployment StrategyUp: Deployment Index