Observability Stack (mpac-obs)
Part of: MPAC SmartPOS Cloud Platform - Product RequirementsVersion: 2.0 Last Updated: 2026-01-28
Overview
mpac-obs is the centralized observability stack for the MPAC SmartPOS platform, providing comprehensive monitoring, logging, tracing, and alerting capabilities. The stack collects telemetry data from all services (svc-portal, svc-smarttab, mpac-pgw) and infrastructure components, enabling real-time visibility into system health, performance, and business metrics.
Stack Components:
- Prometheus (v3.8) - Time-series metrics storage and querying
- Loki (v3.6) - Log aggregation and indexing
- Tempo (v2.9) - Distributed tracing backend
- Grafana (v12.3) - Unified visualization and dashboards
- Alloy (v1.12) - OTLP collector for metrics, logs, and traces
Deployment Models:
- Local Development: Docker Compose
- AWS Production: ECS Fargate with CloudFormation
Table of Contents
- Directory Structure
- Local Development Setup
- AWS Deployment
- Service Integration
- Configuration Files
- Access & Credentials
- Data Retention
- Performance Considerations
Directory Structure
mpac-obs/
├── docker-compose.yml # Local development stack
├── .env.example # Environment variables template
│
├── alloy/ # OTLP collector configuration
│ ├── config.alloy # Main configuration file
│ └── README.md # Alloy setup guide
│
├── prometheus/ # Metrics storage configuration
│ ├── prometheus.yml # Scrape configs and rules
│ ├── alerts.yml # Alert rules
│ └── recording_rules.yml # Recording rules for aggregations
│
├── loki/ # Log aggregation configuration
│ ├── loki-config.yml # Storage and ingestion config
│ └── README.md # Loki setup guide
│
├── tempo/ # Distributed tracing configuration
│ ├── tempo.yml # Trace storage and query config
│ └── README.md # Tempo setup guide
│
├── grafana/ # Dashboards and provisioning
│ ├── provisioning/
│ │ ├── datasources/ # Auto-configure data sources
│ │ │ ├── prometheus.yml
│ │ │ ├── loki.yml
│ │ │ └── tempo.yml
│ │ └── dashboards/ # Auto-import dashboards
│ │ ├── dashboards.yml
│ │ ├── service-health.json
│ │ ├── payment-processing.json
│ │ ├── device-fleet.json
│ │ ├── database-performance.json
│ │ └── business-metrics.json
│ └── README.md # Grafana setup guide
│
└── cloudformation/ # AWS deployment (ECS Fargate)
├── Makefile # Deployment automation
├── parameters.json # CloudFormation parameters
├── mpac-obs-stack.yml # Main CloudFormation template
├── networking.yml # VPC, subnets, security groups
├── ecs-cluster.yml # ECS cluster and services
├── storage.yml # EFS for persistent storage
└── README.md # AWS deployment guideLocal Development Setup
Prerequisites
- Docker Desktop (or Docker Engine + Docker Compose)
- 8GB+ RAM available for Docker
- Ports available: 3000 (Grafana), 9090 (Prometheus), 4317/4318 (Alloy)
Quick Start
cd mpac-obs
# Copy environment template
cp .env.example .env
# Start all services
docker compose up -d
# Verify services are running
docker compose ps
# View logs
docker compose logs -f
# Stop all services
docker compose down
# Stop and remove volumes (clean slate)
docker compose down -vService Endpoints (Local)
| Service | URL | Credentials |
|---|---|---|
| Grafana | http://localhost:3000 | admin / admin (change on first login) |
| Prometheus | http://localhost:9090 | - |
| Loki | http://localhost:3100 | - |
| Tempo | http://localhost:3200 | - |
| Alloy (OTLP) | grpc://localhost:4317, http://localhost:4318 | - |
Docker Compose Configuration
Services Defined:
version: '3.8'
services:
# OTLP Collector
alloy:
image: grafana/alloy:1.12.0
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "12345:12345" # Alloy UI
volumes:
- ./alloy/config.alloy:/etc/alloy/config.alloy
command: run --server.http.listen-addr=0.0.0.0:12345 /etc/alloy/config.alloy
networks:
- mpac-obs
# Metrics Storage
prometheus:
image: prom/prometheus:v3.8.0
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/alerts.yml:/etc/prometheus/alerts.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=14d'
- '--web.enable-lifecycle'
networks:
- mpac-obs
# Log Aggregation
loki:
image: grafana/loki:3.6.0
ports:
- "3100:3100"
volumes:
- ./loki/loki-config.yml:/etc/loki/loki-config.yml
- loki-data:/loki
command: -config.file=/etc/loki/loki-config.yml
networks:
- mpac-obs
# Distributed Tracing
tempo:
image: grafana/tempo:2.9.0
ports:
- "3200:3200" # Tempo HTTP
- "4317:4317" # OTLP gRPC (shared with Alloy)
volumes:
- ./tempo/tempo.yml:/etc/tempo.yml
- tempo-data:/tmp/tempo
command: -config.file=/etc/tempo.yml
networks:
- mpac-obs
# Visualization
grafana:
image: grafana/grafana:12.3.0
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Viewer
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning
- grafana-data:/var/lib/grafana
networks:
- mpac-obs
volumes:
prometheus-data:
loki-data:
tempo-data:
grafana-data:
networks:
mpac-obs:
driver: bridgeVerifying Local Setup
# Check Prometheus is scraping targets
curl http://localhost:9090/api/v1/targets
# Send test log to Loki
curl -X POST http://localhost:3100/loki/api/v1/push \
-H "Content-Type: application/json" \
-d '{"streams":[{"stream":{"service":"test"},"values":[["'$(date +%s)000000000'","test log message"]]}]}'
# Query Loki logs
curl -G http://localhost:3100/loki/api/v1/query \
--data-urlencode 'query={service="test"}'
# Send test trace to Tempo (via Alloy OTLP endpoint)
# Use OpenTelemetry SDK from application servicesAWS Deployment
Architecture
┌─────────────────────────────────────────────────────────────┐
│ AWS Region │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ ECS Cluster: mpac-obs-cluster │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Alloy │ │Prometheus│ │ Loki │ │ │
│ │ │(Fargate) │ │(Fargate) │ │(Fargate) │ │ │
│ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │
│ │ │ │ │ │ │
│ │ ┌────▼─────────────▼─────────────▼──────┐ │ │
│ │ │ Tempo (Fargate) │ │ │
│ │ └────────────────┬───────────────────────┘ │ │
│ │ │ │ │
│ │ ┌────────────────▼───────────────────────┐ │ │
│ │ │ Grafana (Fargate) │ │ │
│ │ └────────────────┬───────────────────────┘ │ │
│ └───────────────────┼───────────────────────────────────┘ │
│ │ │
│ ┌───────────────────▼───────────────────────┐ │
│ │ Application Load Balancer │ │
│ │ - grafana.obs.mpac-cloud.com │ │
│ │ - alloy.obs.mpac-cloud.com:4317/4318 │ │
│ └───────────────────┬───────────────────────┘ │
│ │ │
│ ┌───────────────────▼───────────────────────┐ │
│ │ Amazon EFS (Persistent Storage) │ │
│ │ - Prometheus TSDB │ │
│ │ - Loki chunks │ │
│ │ - Tempo blocks │ │
│ └────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘CloudFormation Deployment
Prerequisites:
- AWS CLI configured with appropriate credentials
- Route 53 hosted zone for DNS (e.g.,
mpac-cloud.com) - ACM certificate for
*.obs.mpac-cloud.com
Deployment Steps:
cd mpac-obs/cloudformation
# 1. Configure parameters
cp parameters.json.example parameters.json
# Edit parameters.json with your AWS account details
# 2. Deploy full stack (networking + ECS + services)
make deploy-full PARENT_HOSTED_ZONE_ID=Z1234567890ABC
# 3. Check deployment status
make status
# 4. Show service endpoints
make show-endpoints
# 5. Update existing stack
make update
# 6. Delete stack (WARNING: destroys all data)
make delete-stackCloudFormation Parameters:
{
"Parameters": {
"Environment": "production",
"VpcCIDR": "10.20.0.0/16",
"PublicSubnet1CIDR": "10.20.1.0/24",
"PublicSubnet2CIDR": "10.20.2.0/24",
"PrivateSubnet1CIDR": "10.20.10.0/24",
"PrivateSubnet2CIDR": "10.20.11.0/24",
"DomainName": "obs.mpac-cloud.com",
"ParentHostedZoneId": "Z1234567890ABC",
"CertificateArn": "arn:aws:acm:ap-southeast-1:123456789012:certificate/abc-def-123",
"PrometheusRetentionDays": "14",
"LokiRetentionDays": "14",
"TempoRetentionDays": "7"
}
}Service Configuration:
| Service | Task Definition | Memory | CPU | Replicas |
|---|---|---|---|---|
| Alloy | alloy:1.12.0 | 2GB | 1 vCPU | 2 |
| Prometheus | prometheus:3.8.0 | 4GB | 2 vCPU | 2 |
| Loki | loki:3.6.0 | 4GB | 2 vCPU | 2 |
| Tempo | tempo:2.9.0 | 4GB | 2 vCPU | 2 |
| Grafana | grafana:12.3.0 | 2GB | 1 vCPU | 2 |
AWS Endpoints (Production)
| Service | URL | Access |
|---|---|---|
| Grafana | https://grafana.obs.mpac-cloud.com | SSO / IAM auth |
| Alloy (OTLP) | grpc://alloy.obs.mpac-cloud.com:4317 | VPC internal |
| Prometheus | http://prometheus.obs.mpac-cloud.internal:9090 | VPC internal |
| Loki | http://loki.obs.mpac-cloud.internal:3100 | VPC internal |
| Tempo | http://tempo.obs.mpac-cloud.internal:3200 | VPC internal |
Service Integration
Application Configuration
Environment Variables for Services:
# svc-portal, svc-smarttab, mpac-pgw
export OTLP_ENDPOINT="http://alloy.obs.mpac-cloud.internal:4318" # HTTP
export OTLP_ENDPOINT_GRPC="http://alloy.obs.mpac-cloud.internal:4317" # gRPC
export OTEL_SERVICE_NAME="svc-smarttab"
export OTEL_RESOURCE_ATTRIBUTES="environment=production,region=ap-southeast-1"Service Auto-Discovery (AWS)
Alloy automatically discovers services from ECS task metadata:
// alloy/config.alloy
discovery.ecs "services" {
region = "ap-southeast-1"
// Discover tasks with prometheus.io/scrape=true label
filter {
name = "tag:prometheus.io/scrape"
values = ["true"]
}
}
// Scrape discovered services
prometheus.scrape "ecs_services" {
targets = discovery.ecs.services.targets
forward_to = [prometheus.remote_write.default.receiver]
}ECS Task Definition Labels:
{
"containerDefinitions": [{
"name": "svc-smarttab",
"dockerLabels": {
"prometheus.io/scrape": "true",
"prometheus.io/port": "8080",
"prometheus.io/path": "/metrics"
}
}]
}Instrumentation Examples
Python (svc-portal):
# requirements.txt
opentelemetry-api==1.20.0
opentelemetry-sdk==1.20.0
opentelemetry-instrumentation-fastapi==0.41b0
opentelemetry-exporter-otlp==1.20.0
# main.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
# Initialize tracing
trace.set_tracer_provider(TracerProvider())
otlp_exporter = OTLPSpanExporter(
endpoint=os.getenv("OTLP_ENDPOINT_GRPC", "http://localhost:4317"),
insecure=True
)
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(otlp_exporter)
)
# Auto-instrument FastAPI
app = FastAPI()
FastAPIInstrumentor.instrument_app(app)
# Expose Prometheus metrics
from prometheus_client import make_asgi_app
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)Go (svc-smarttab, mpac-pgw):
// go.mod
require (
go.opentelemetry.io/otel v1.20.0
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.20.0
go.opentelemetry.io/otel/sdk v1.20.0
github.com/prometheus/client_golang v1.17.0
)
// main.go
package main
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/trace"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
func initTracing() error {
exporter, err := otlptracegrpc.New(
context.Background(),
otlptracegrpc.WithEndpoint("alloy.obs.mpac-cloud.internal:4317"),
otlptracegrpc.WithInsecure(),
)
if err != nil {
return err
}
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
)
otel.SetTracerProvider(tp)
return nil
}
func main() {
initTracing()
// Expose Prometheus metrics
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":8080", nil)
}Configuration Files
Alloy Configuration (OTLP Collector)
File: alloy/config.alloy
// OTLP Receiver for logs, metrics, traces
otelcol.receiver.otlp "default" {
grpc {
endpoint = "0.0.0.0:4317"
}
http {
endpoint = "0.0.0.0:4318"
}
output {
metrics = [otelcol.processor.batch.default.input]
logs = [otelcol.processor.batch.default.input]
traces = [otelcol.processor.batch.default.input]
}
}
// Batch processor (improve performance)
otelcol.processor.batch "default" {
timeout = "5s"
send_batch_size = 1000
output {
metrics = [otelcol.exporter.prometheus.default.input]
logs = [otelcol.exporter.loki.default.input]
traces = [otelcol.exporter.otlp.tempo.input]
}
}
// Export metrics to Prometheus
otelcol.exporter.prometheus "default" {
endpoint = "http://prometheus:9090/api/v1/write"
}
// Export logs to Loki
otelcol.exporter.loki "default" {
endpoint = "http://loki:3100/loki/api/v1/push"
}
// Export traces to Tempo
otelcol.exporter.otlp "tempo" {
client {
endpoint = "tempo:4317"
insecure = true
}
}Prometheus Configuration
File: prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'mpac-production'
region: 'ap-southeast-1'
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 'alertmanager:9093'
# Load alert rules
rule_files:
- 'alerts.yml'
- 'recording_rules.yml'
# Scrape configurations
scrape_configs:
# Self-monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Service discovery for ECS tasks (AWS)
- job_name: 'ecs-services'
ec2_sd_configs:
- region: ap-southeast-1
port: 8080
filters:
- name: tag:prometheus.io/scrape
values: ['true']
relabel_configs:
- source_labels: [__meta_ec2_tag_Name]
target_label: service_name
# Static targets (local development)
- job_name: 'svc-portal'
static_configs:
- targets: ['svc-portal:8002']
labels:
service: 'svc-portal'
- job_name: 'svc-smarttab'
static_configs:
- targets: ['svc-smarttab:8080']
labels:
service: 'svc-smarttab'
- job_name: 'mpac-pgw'
static_configs:
- targets: ['mpac-pgw:8080']
labels:
service: 'mpac-pgw'Loki Configuration
File: loki/loki-config.yml
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
address: 0.0.0.0
ring:
kvstore:
store: inmemory
replication_factor: 1
chunk_idle_period: 5m
chunk_retain_period: 30s
schema_config:
configs:
- from: 2024-01-01
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/cache
shared_store: filesystem
filesystem:
directory: /loki/chunks
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h # 7 days
retention_period: 336h # 14 days
chunk_store_config:
max_look_back_period: 336h # 14 days
table_manager:
retention_deletes_enabled: true
retention_period: 336h # 14 daysTempo Configuration
File: tempo/tempo.yml
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
ingester:
trace_idle_period: 10s
max_block_bytes: 1_000_000
max_block_duration: 5m
compactor:
compaction:
block_retention: 168h # 7 days
storage:
trace:
backend: local
local:
path: /tmp/tempo/blocks
wal:
path: /tmp/tempo/wal
pool:
max_workers: 100
queue_depth: 10000Access & Credentials
Local Development
| Service | Default Credentials | Change Method |
|---|---|---|
| Grafana | admin / admin | Change on first login |
| Prometheus | No auth | - |
| Loki | No auth | - |
| Tempo | No auth | - |
AWS Production
| Service | Authentication | Authorization |
|---|---|---|
| Grafana | AWS SSO / SAML | Role-based (Viewer, Editor, Admin) |
| Internal Services | VPC security groups | Network-level isolation |
Grafana SSO Configuration (AWS):
# Environment variables for Grafana container
GF_AUTH_GENERIC_OAUTH_ENABLED=true
GF_AUTH_GENERIC_OAUTH_NAME=AWS SSO
GF_AUTH_GENERIC_OAUTH_CLIENT_ID=${SSO_CLIENT_ID}
GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET=${SSO_CLIENT_SECRET}
GF_AUTH_GENERIC_OAUTH_SCOPES=openid profile email
GF_AUTH_GENERIC_OAUTH_AUTH_URL=https://portal.sso.ap-southeast-1.amazonaws.com/oauth2/authorize
GF_AUTH_GENERIC_OAUTH_TOKEN_URL=https://portal.sso.ap-southeast-1.amazonaws.com/oauth2/token
GF_AUTH_GENERIC_OAUTH_API_URL=https://portal.sso.ap-southeast-1.amazonaws.com/oauth2/userInfoData Retention
| Component | Default Retention | Configurable | Storage Location |
|---|---|---|---|
| Prometheus | 14 days | Yes (via --storage.tsdb.retention.time) | EFS: /mnt/prometheus |
| Loki | 14 days | Yes (retention_period in config) | EFS: /mnt/loki/chunks |
| Tempo | 7 days | Yes (block_retention in config) | EFS: /mnt/tempo/blocks |
| Grafana Dashboards | Permanent | - | EFS: /var/lib/grafana |
Adjusting Retention (AWS):
# Update CloudFormation parameters
{
"PrometheusRetentionDays": "30", # Increase to 30 days
"LokiRetentionDays": "30",
"TempoRetentionDays": "14"
}
# Deploy update
make updateEstimated Storage Requirements:
| Retention Period | Prometheus | Loki | Tempo | Total |
|---|---|---|---|---|
| 7 days | 50 GB | 100 GB | 200 GB | 350 GB |
| 14 days | 100 GB | 200 GB | 400 GB | 700 GB |
| 30 days | 200 GB | 400 GB | 800 GB | 1.4 TB |
Performance Considerations
Resource Sizing
Local Development (Docker):
- Minimum: 8 GB RAM, 4 CPU cores
- Recommended: 16 GB RAM, 8 CPU cores
- Storage: 50 GB SSD
AWS Production (Per Service):
| Service | Memory | CPU | Storage (EFS) | Cost/Month |
|---|---|---|---|---|
| Alloy | 2 GB | 1 vCPU | - | ~$30 |
| Prometheus | 4 GB | 2 vCPU | 100 GB | ~$90 |
| Loki | 4 GB | 2 vCPU | 200 GB | ~$110 |
| Tempo | 4 GB | 2 vCPU | 400 GB | ~$150 |
| Grafana | 2 GB | 1 vCPU | 10 GB | ~$35 |
| Total | 16 GB | 8 vCPU | 710 GB | ~$415/month |
Query Performance
Prometheus Query Best Practices:
# Good: Specific time range
rate(http_requests_total[5m])
# Bad: Large time range (slow)
rate(http_requests_total[1d])
# Good: Pre-aggregated recording rules
job:http_requests:rate5m
# Good: Limit cardinality
sum by (service, status) (http_requests_total)Loki Query Best Practices:
# Good: Use labels for filtering
{service="svc-smarttab", level="error"} |= "payment failed"
# Bad: Full text search without labels (very slow)
{} |= "payment failed"
# Good: Time-bounded queries
{service="svc-smarttab"} [5m]Scaling Considerations
Horizontal Scaling (AWS):
- Prometheus: Use federation or remote write to multiple instances
- Loki: Deploy multiple ingesters with consistent hashing
- Tempo: Scale distributors independently from ingesters
- Grafana: Use RDS PostgreSQL for dashboard/user storage
Vertical Scaling:
- Increase ECS task memory/CPU when query latency increases
- Monitor memory usage: scale up at 80% utilization
- Monitor CPU usage: scale up at 70% utilization
See Also
Related Deployment:
- AWS Infrastructure - ECS cluster and networking
- Deployment Strategy - Deployment procedures
- Environments - Environment configurations
- CI/CD Pipeline - Automated deployments
Related Technical:
- Performance & Scalability - Detailed observability implementation
- Communication Patterns - Service integration
- Security Architecture - Security monitoring
Related Domains:
- Reporting & Analytics - Business metrics
- Payment Gateway - Payment processing metrics
Navigation: ↑ Back to Deployment Index | ← Previous: Environments | Next: CI/CD Pipeline →