Skip to content

Observability Stack (mpac-obs)

Part of: MPAC SmartPOS Cloud Platform - Product RequirementsVersion: 2.0 Last Updated: 2026-01-28


Overview

mpac-obs is the centralized observability stack for the MPAC SmartPOS platform, providing comprehensive monitoring, logging, tracing, and alerting capabilities. The stack collects telemetry data from all services (svc-portal, svc-smarttab, mpac-pgw) and infrastructure components, enabling real-time visibility into system health, performance, and business metrics.

Stack Components:

  • Prometheus (v3.8) - Time-series metrics storage and querying
  • Loki (v3.6) - Log aggregation and indexing
  • Tempo (v2.9) - Distributed tracing backend
  • Grafana (v12.3) - Unified visualization and dashboards
  • Alloy (v1.12) - OTLP collector for metrics, logs, and traces

Deployment Models:

  • Local Development: Docker Compose
  • AWS Production: ECS Fargate with CloudFormation

Table of Contents


Directory Structure

mpac-obs/
├── docker-compose.yml          # Local development stack
├── .env.example                # Environment variables template

├── alloy/                      # OTLP collector configuration
│   ├── config.alloy            # Main configuration file
│   └── README.md               # Alloy setup guide

├── prometheus/                 # Metrics storage configuration
│   ├── prometheus.yml          # Scrape configs and rules
│   ├── alerts.yml              # Alert rules
│   └── recording_rules.yml    # Recording rules for aggregations

├── loki/                       # Log aggregation configuration
│   ├── loki-config.yml         # Storage and ingestion config
│   └── README.md               # Loki setup guide

├── tempo/                      # Distributed tracing configuration
│   ├── tempo.yml               # Trace storage and query config
│   └── README.md               # Tempo setup guide

├── grafana/                    # Dashboards and provisioning
│   ├── provisioning/
│   │   ├── datasources/        # Auto-configure data sources
│   │   │   ├── prometheus.yml
│   │   │   ├── loki.yml
│   │   │   └── tempo.yml
│   │   └── dashboards/         # Auto-import dashboards
│   │       ├── dashboards.yml
│   │       ├── service-health.json
│   │       ├── payment-processing.json
│   │       ├── device-fleet.json
│   │       ├── database-performance.json
│   │       └── business-metrics.json
│   └── README.md               # Grafana setup guide

└── cloudformation/             # AWS deployment (ECS Fargate)
    ├── Makefile                # Deployment automation
    ├── parameters.json         # CloudFormation parameters
    ├── mpac-obs-stack.yml     # Main CloudFormation template
    ├── networking.yml          # VPC, subnets, security groups
    ├── ecs-cluster.yml         # ECS cluster and services
    ├── storage.yml             # EFS for persistent storage
    └── README.md               # AWS deployment guide

Local Development Setup

Prerequisites

  • Docker Desktop (or Docker Engine + Docker Compose)
  • 8GB+ RAM available for Docker
  • Ports available: 3000 (Grafana), 9090 (Prometheus), 4317/4318 (Alloy)

Quick Start

bash
cd mpac-obs

# Copy environment template
cp .env.example .env

# Start all services
docker compose up -d

# Verify services are running
docker compose ps

# View logs
docker compose logs -f

# Stop all services
docker compose down

# Stop and remove volumes (clean slate)
docker compose down -v

Service Endpoints (Local)

ServiceURLCredentials
Grafanahttp://localhost:3000admin / admin (change on first login)
Prometheushttp://localhost:9090-
Lokihttp://localhost:3100-
Tempohttp://localhost:3200-
Alloy (OTLP)grpc://localhost:4317, http://localhost:4318-

Docker Compose Configuration

Services Defined:

yaml
version: '3.8'

services:
  # OTLP Collector
  alloy:
    image: grafana/alloy:1.12.0
    ports:
      - "4317:4317"  # OTLP gRPC
      - "4318:4318"  # OTLP HTTP
      - "12345:12345" # Alloy UI
    volumes:
      - ./alloy/config.alloy:/etc/alloy/config.alloy
    command: run --server.http.listen-addr=0.0.0.0:12345 /etc/alloy/config.alloy
    networks:
      - mpac-obs

  # Metrics Storage
  prometheus:
    image: prom/prometheus:v3.8.0
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/alerts.yml:/etc/prometheus/alerts.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=14d'
      - '--web.enable-lifecycle'
    networks:
      - mpac-obs

  # Log Aggregation
  loki:
    image: grafana/loki:3.6.0
    ports:
      - "3100:3100"
    volumes:
      - ./loki/loki-config.yml:/etc/loki/loki-config.yml
      - loki-data:/loki
    command: -config.file=/etc/loki/loki-config.yml
    networks:
      - mpac-obs

  # Distributed Tracing
  tempo:
    image: grafana/tempo:2.9.0
    ports:
      - "3200:3200"   # Tempo HTTP
      - "4317:4317"   # OTLP gRPC (shared with Alloy)
    volumes:
      - ./tempo/tempo.yml:/etc/tempo.yml
      - tempo-data:/tmp/tempo
    command: -config.file=/etc/tempo.yml
    networks:
      - mpac-obs

  # Visualization
  grafana:
    image: grafana/grafana:12.3.0
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Viewer
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning
      - grafana-data:/var/lib/grafana
    networks:
      - mpac-obs

volumes:
  prometheus-data:
  loki-data:
  tempo-data:
  grafana-data:

networks:
  mpac-obs:
    driver: bridge

Verifying Local Setup

bash
# Check Prometheus is scraping targets
curl http://localhost:9090/api/v1/targets

# Send test log to Loki
curl -X POST http://localhost:3100/loki/api/v1/push \
  -H "Content-Type: application/json" \
  -d '{"streams":[{"stream":{"service":"test"},"values":[["'$(date +%s)000000000'","test log message"]]}]}'

# Query Loki logs
curl -G http://localhost:3100/loki/api/v1/query \
  --data-urlencode 'query={service="test"}'

# Send test trace to Tempo (via Alloy OTLP endpoint)
# Use OpenTelemetry SDK from application services

AWS Deployment

Architecture

┌─────────────────────────────────────────────────────────────┐
│                       AWS Region                            │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌───────────────────────────────────────────────────────┐ │
│  │  ECS Cluster: mpac-obs-cluster                       │ │
│  │                                                       │ │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐           │ │
│  │  │  Alloy   │  │Prometheus│  │   Loki   │           │ │
│  │  │(Fargate) │  │(Fargate) │  │(Fargate) │           │ │
│  │  └────┬─────┘  └────┬─────┘  └────┬─────┘           │ │
│  │       │             │             │                  │ │
│  │  ┌────▼─────────────▼─────────────▼──────┐          │ │
│  │  │         Tempo (Fargate)                │          │ │
│  │  └────────────────┬───────────────────────┘          │ │
│  │                   │                                  │ │
│  │  ┌────────────────▼───────────────────────┐          │ │
│  │  │         Grafana (Fargate)              │          │ │
│  │  └────────────────┬───────────────────────┘          │ │
│  └───────────────────┼───────────────────────────────────┘ │
│                      │                                     │
│  ┌───────────────────▼───────────────────────┐            │
│  │         Application Load Balancer          │            │
│  │  - grafana.obs.mpac-cloud.com               │            │
│  │  - alloy.obs.mpac-cloud.com:4317/4318       │            │
│  └───────────────────┬───────────────────────┘            │
│                      │                                     │
│  ┌───────────────────▼───────────────────────┐            │
│  │         Amazon EFS (Persistent Storage)    │            │
│  │  - Prometheus TSDB                         │            │
│  │  - Loki chunks                             │            │
│  │  - Tempo blocks                            │            │
│  └────────────────────────────────────────────┘            │
└─────────────────────────────────────────────────────────────┘

CloudFormation Deployment

Prerequisites:

  • AWS CLI configured with appropriate credentials
  • Route 53 hosted zone for DNS (e.g., mpac-cloud.com)
  • ACM certificate for *.obs.mpac-cloud.com

Deployment Steps:

bash
cd mpac-obs/cloudformation

# 1. Configure parameters
cp parameters.json.example parameters.json
# Edit parameters.json with your AWS account details

# 2. Deploy full stack (networking + ECS + services)
make deploy-full PARENT_HOSTED_ZONE_ID=Z1234567890ABC

# 3. Check deployment status
make status

# 4. Show service endpoints
make show-endpoints

# 5. Update existing stack
make update

# 6. Delete stack (WARNING: destroys all data)
make delete-stack

CloudFormation Parameters:

json
{
  "Parameters": {
    "Environment": "production",
    "VpcCIDR": "10.20.0.0/16",
    "PublicSubnet1CIDR": "10.20.1.0/24",
    "PublicSubnet2CIDR": "10.20.2.0/24",
    "PrivateSubnet1CIDR": "10.20.10.0/24",
    "PrivateSubnet2CIDR": "10.20.11.0/24",
    "DomainName": "obs.mpac-cloud.com",
    "ParentHostedZoneId": "Z1234567890ABC",
    "CertificateArn": "arn:aws:acm:ap-southeast-1:123456789012:certificate/abc-def-123",
    "PrometheusRetentionDays": "14",
    "LokiRetentionDays": "14",
    "TempoRetentionDays": "7"
  }
}

Service Configuration:

ServiceTask DefinitionMemoryCPUReplicas
Alloyalloy:1.12.02GB1 vCPU2
Prometheusprometheus:3.8.04GB2 vCPU2
Lokiloki:3.6.04GB2 vCPU2
Tempotempo:2.9.04GB2 vCPU2
Grafanagrafana:12.3.02GB1 vCPU2

AWS Endpoints (Production)

ServiceURLAccess
Grafanahttps://grafana.obs.mpac-cloud.comSSO / IAM auth
Alloy (OTLP)grpc://alloy.obs.mpac-cloud.com:4317VPC internal
Prometheushttp://prometheus.obs.mpac-cloud.internal:9090VPC internal
Lokihttp://loki.obs.mpac-cloud.internal:3100VPC internal
Tempohttp://tempo.obs.mpac-cloud.internal:3200VPC internal

Service Integration

Application Configuration

Environment Variables for Services:

bash
# svc-portal, svc-smarttab, mpac-pgw
export OTLP_ENDPOINT="http://alloy.obs.mpac-cloud.internal:4318"  # HTTP
export OTLP_ENDPOINT_GRPC="http://alloy.obs.mpac-cloud.internal:4317"  # gRPC
export OTEL_SERVICE_NAME="svc-smarttab"
export OTEL_RESOURCE_ATTRIBUTES="environment=production,region=ap-southeast-1"

Service Auto-Discovery (AWS)

Alloy automatically discovers services from ECS task metadata:

hcl
// alloy/config.alloy
discovery.ecs "services" {
  region = "ap-southeast-1"

  // Discover tasks with prometheus.io/scrape=true label
  filter {
    name = "tag:prometheus.io/scrape"
    values = ["true"]
  }
}

// Scrape discovered services
prometheus.scrape "ecs_services" {
  targets    = discovery.ecs.services.targets
  forward_to = [prometheus.remote_write.default.receiver]
}

ECS Task Definition Labels:

json
{
  "containerDefinitions": [{
    "name": "svc-smarttab",
    "dockerLabels": {
      "prometheus.io/scrape": "true",
      "prometheus.io/port": "8080",
      "prometheus.io/path": "/metrics"
    }
  }]
}

Instrumentation Examples

Python (svc-portal):

python
# requirements.txt
opentelemetry-api==1.20.0
opentelemetry-sdk==1.20.0
opentelemetry-instrumentation-fastapi==0.41b0
opentelemetry-exporter-otlp==1.20.0

# main.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

# Initialize tracing
trace.set_tracer_provider(TracerProvider())
otlp_exporter = OTLPSpanExporter(
    endpoint=os.getenv("OTLP_ENDPOINT_GRPC", "http://localhost:4317"),
    insecure=True
)
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(otlp_exporter)
)

# Auto-instrument FastAPI
app = FastAPI()
FastAPIInstrumentor.instrument_app(app)

# Expose Prometheus metrics
from prometheus_client import make_asgi_app
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)

Go (svc-smarttab, mpac-pgw):

go
// go.mod
require (
    go.opentelemetry.io/otel v1.20.0
    go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.20.0
    go.opentelemetry.io/otel/sdk v1.20.0
    github.com/prometheus/client_golang v1.17.0
)

// main.go
package main

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/trace"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

func initTracing() error {
    exporter, err := otlptracegrpc.New(
        context.Background(),
        otlptracegrpc.WithEndpoint("alloy.obs.mpac-cloud.internal:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return err
    }

    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
    )
    otel.SetTracerProvider(tp)
    return nil
}

func main() {
    initTracing()

    // Expose Prometheus metrics
    http.Handle("/metrics", promhttp.Handler())
    http.ListenAndServe(":8080", nil)
}

Configuration Files

Alloy Configuration (OTLP Collector)

File: alloy/config.alloy

hcl
// OTLP Receiver for logs, metrics, traces
otelcol.receiver.otlp "default" {
  grpc {
    endpoint = "0.0.0.0:4317"
  }

  http {
    endpoint = "0.0.0.0:4318"
  }

  output {
    metrics = [otelcol.processor.batch.default.input]
    logs    = [otelcol.processor.batch.default.input]
    traces  = [otelcol.processor.batch.default.input]
  }
}

// Batch processor (improve performance)
otelcol.processor.batch "default" {
  timeout = "5s"
  send_batch_size = 1000

  output {
    metrics = [otelcol.exporter.prometheus.default.input]
    logs    = [otelcol.exporter.loki.default.input]
    traces  = [otelcol.exporter.otlp.tempo.input]
  }
}

// Export metrics to Prometheus
otelcol.exporter.prometheus "default" {
  endpoint = "http://prometheus:9090/api/v1/write"
}

// Export logs to Loki
otelcol.exporter.loki "default" {
  endpoint = "http://loki:3100/loki/api/v1/push"
}

// Export traces to Tempo
otelcol.exporter.otlp "tempo" {
  client {
    endpoint = "tempo:4317"
    insecure = true
  }
}

Prometheus Configuration

File: prometheus/prometheus.yml

yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'mpac-production'
    region: 'ap-southeast-1'

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - 'alertmanager:9093'

# Load alert rules
rule_files:
  - 'alerts.yml'
  - 'recording_rules.yml'

# Scrape configurations
scrape_configs:
  # Self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Service discovery for ECS tasks (AWS)
  - job_name: 'ecs-services'
    ec2_sd_configs:
      - region: ap-southeast-1
        port: 8080
        filters:
          - name: tag:prometheus.io/scrape
            values: ['true']
    relabel_configs:
      - source_labels: [__meta_ec2_tag_Name]
        target_label: service_name

  # Static targets (local development)
  - job_name: 'svc-portal'
    static_configs:
      - targets: ['svc-portal:8002']
        labels:
          service: 'svc-portal'

  - job_name: 'svc-smarttab'
    static_configs:
      - targets: ['svc-smarttab:8080']
        labels:
          service: 'svc-smarttab'

  - job_name: 'mpac-pgw'
    static_configs:
      - targets: ['mpac-pgw:8080']
        labels:
          service: 'mpac-pgw'

Loki Configuration

File: loki/loki-config.yml

yaml
auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    address: 0.0.0.0
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
  chunk_idle_period: 5m
  chunk_retain_period: 30s

schema_config:
  configs:
    - from: 2024-01-01
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /loki/index
    cache_location: /loki/cache
    shared_store: filesystem
  filesystem:
    directory: /loki/chunks

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h  # 7 days
  retention_period: 336h  # 14 days

chunk_store_config:
  max_look_back_period: 336h  # 14 days

table_manager:
  retention_deletes_enabled: true
  retention_period: 336h  # 14 days

Tempo Configuration

File: tempo/tempo.yml

yaml
server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318

ingester:
  trace_idle_period: 10s
  max_block_bytes: 1_000_000
  max_block_duration: 5m

compactor:
  compaction:
    block_retention: 168h  # 7 days

storage:
  trace:
    backend: local
    local:
      path: /tmp/tempo/blocks
    wal:
      path: /tmp/tempo/wal
    pool:
      max_workers: 100
      queue_depth: 10000

Access & Credentials

Local Development

ServiceDefault CredentialsChange Method
Grafanaadmin / adminChange on first login
PrometheusNo auth-
LokiNo auth-
TempoNo auth-

AWS Production

ServiceAuthenticationAuthorization
GrafanaAWS SSO / SAMLRole-based (Viewer, Editor, Admin)
Internal ServicesVPC security groupsNetwork-level isolation

Grafana SSO Configuration (AWS):

bash
# Environment variables for Grafana container
GF_AUTH_GENERIC_OAUTH_ENABLED=true
GF_AUTH_GENERIC_OAUTH_NAME=AWS SSO
GF_AUTH_GENERIC_OAUTH_CLIENT_ID=${SSO_CLIENT_ID}
GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET=${SSO_CLIENT_SECRET}
GF_AUTH_GENERIC_OAUTH_SCOPES=openid profile email
GF_AUTH_GENERIC_OAUTH_AUTH_URL=https://portal.sso.ap-southeast-1.amazonaws.com/oauth2/authorize
GF_AUTH_GENERIC_OAUTH_TOKEN_URL=https://portal.sso.ap-southeast-1.amazonaws.com/oauth2/token
GF_AUTH_GENERIC_OAUTH_API_URL=https://portal.sso.ap-southeast-1.amazonaws.com/oauth2/userInfo

Data Retention

ComponentDefault RetentionConfigurableStorage Location
Prometheus14 daysYes (via --storage.tsdb.retention.time)EFS: /mnt/prometheus
Loki14 daysYes (retention_period in config)EFS: /mnt/loki/chunks
Tempo7 daysYes (block_retention in config)EFS: /mnt/tempo/blocks
Grafana DashboardsPermanent-EFS: /var/lib/grafana

Adjusting Retention (AWS):

bash
# Update CloudFormation parameters
{
  "PrometheusRetentionDays": "30",  # Increase to 30 days
  "LokiRetentionDays": "30",
  "TempoRetentionDays": "14"
}

# Deploy update
make update

Estimated Storage Requirements:

Retention PeriodPrometheusLokiTempoTotal
7 days50 GB100 GB200 GB350 GB
14 days100 GB200 GB400 GB700 GB
30 days200 GB400 GB800 GB1.4 TB

Performance Considerations

Resource Sizing

Local Development (Docker):

  • Minimum: 8 GB RAM, 4 CPU cores
  • Recommended: 16 GB RAM, 8 CPU cores
  • Storage: 50 GB SSD

AWS Production (Per Service):

ServiceMemoryCPUStorage (EFS)Cost/Month
Alloy2 GB1 vCPU-~$30
Prometheus4 GB2 vCPU100 GB~$90
Loki4 GB2 vCPU200 GB~$110
Tempo4 GB2 vCPU400 GB~$150
Grafana2 GB1 vCPU10 GB~$35
Total16 GB8 vCPU710 GB~$415/month

Query Performance

Prometheus Query Best Practices:

promql
# Good: Specific time range
rate(http_requests_total[5m])

# Bad: Large time range (slow)
rate(http_requests_total[1d])

# Good: Pre-aggregated recording rules
job:http_requests:rate5m

# Good: Limit cardinality
sum by (service, status) (http_requests_total)

Loki Query Best Practices:

logql
# Good: Use labels for filtering
{service="svc-smarttab", level="error"} |= "payment failed"

# Bad: Full text search without labels (very slow)
{} |= "payment failed"

# Good: Time-bounded queries
{service="svc-smarttab"} [5m]

Scaling Considerations

Horizontal Scaling (AWS):

  • Prometheus: Use federation or remote write to multiple instances
  • Loki: Deploy multiple ingesters with consistent hashing
  • Tempo: Scale distributors independently from ingesters
  • Grafana: Use RDS PostgreSQL for dashboard/user storage

Vertical Scaling:

  • Increase ECS task memory/CPU when query latency increases
  • Monitor memory usage: scale up at 80% utilization
  • Monitor CPU usage: scale up at 70% utilization

See Also

Related Deployment:

Related Technical:

Related Domains:


Navigation: ↑ Back to Deployment Index | ← Previous: Environments | Next: CI/CD Pipeline →

MPAC — MP-Solution Advanced Cloud Service