Day-to-day operational procedures, monitoring, alerting, backup, and disaster recovery.
All services expose metrics via Prometheus-compatible /metrics endpoints.
| Metric | Type | Target | Alert Threshold |
|---|---|---|---|
| API latency (p50) | Histogram | < 50 ms | > 100 ms |
| API latency (p99) | Histogram | < 500 ms | > 2,000 ms |
| API error rate (5xx) | Counter | < 0.1% | > 1% |
| DB query time (p95) | Histogram | < 50 ms | > 200 ms |
| DB connection pool utilization | Gauge | < 70% | > 85% |
| WebSocket active connections | Gauge | — | > 5,000 |
| Redis memory usage | Gauge | < 70% | > 85% |
| Redis hit rate | Gauge | > 90% | < 80% |
| Redis operations per second | Counter | — | > 40,000 |
| AI API response time (p95) | Histogram | < 5,000 ms | > 15,000 ms |
| AI API daily cost (USD) | Gauge | < $50/day | > $100/day |
| AI API error rate | Counter | < 1% | > 5% |
| Node.js heap usage | Gauge | < 70% | > 85% |
| CPU utilization (per pod) | Gauge | < 60% | > 80% |
| Memory utilization (per pod) | Gauge | < 70% | > 85% |
| Disk I/O (PostgreSQL) | Counter | — | > 80% of capacity |
Recommended dashboards (import IDs for Grafana):
| Dashboard | Grafana ID | Description |
|---|---|---|
| Node.js Application | 11159 | Process metrics, event loop lag |
| PostgreSQL Overview | 9628 | Query performance, connections |
| Redis Overview | 11835 | Memory, hit rate, ops/sec |
| Nginx Overview | 12708 | Requests, upstream latency |
| Kubernetes Cluster | 6417 | Pod resources, node health |
# prometheus/rules/api-alerts.yml
groups:
- name: skillquest-api
rules:
- alert: HighAPILatency
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service="api"}[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "API p99 latency exceeds 2s"
description: "API p99 latency is s on "
- alert: HighAPIErrorRate
expr: rate(http_requests_total{service="api", status=~"5.."}[5m]) / rate(http_requests_total{service="api"}[5m]) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "API 5xx error rate exceeds 1%"
description: "Error rate is on "
- alert: APIDown
expr: up{job="skillquest-api"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "API service is down"
description: " has been down for more than 1 minute"
# prometheus/rules/db-alerts.yml
groups:
- name: skillquest-database
rules:
- alert: DatabaseConnectionPoolExhausted
expr: pg_stat_activity_count / pg_settings_max_connections > 0.85
for: 5m
labels:
severity: critical
annotations:
summary: "DB connection pool above 85%"
description: "Active connections: of max"
- alert: SlowQueries
expr: pg_stat_activity_max_tx_duration{state="active"} > 30
for: 2m
labels:
severity: warning
annotations:
summary: "Long-running query detected (>30s)"
- alert: DatabaseDiskSpaceLow
expr: pg_database_size_bytes / node_filesystem_size_bytes * 100 > 80
for: 10m
labels:
severity: warning
annotations:
summary: "Database disk usage above 80%"
# prometheus/rules/redis-alerts.yml
groups:
- name: skillquest-redis
rules:
- alert: RedisMemoryHigh
expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "Redis memory usage above 85%"
- alert: RedisHitRateLow
expr: redis_keyspace_hits_total / (redis_keyspace_hits_total + redis_keyspace_misses_total) < 0.80
for: 10m
labels:
severity: warning
annotations:
summary: "Redis hit rate below 80%"
- alert: RedisDown
expr: redis_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Redis is unreachable"
# prometheus/rules/ai-alerts.yml
groups:
- name: skillquest-ai
rules:
- alert: AIHighLatency
expr: histogram_quantile(0.95, rate(ai_request_duration_seconds_bucket[5m])) > 15
for: 5m
labels:
severity: warning
annotations:
summary: "AI API p95 latency exceeds 15s"
- alert: AIHighCost
expr: sum(increase(ai_api_cost_usd_total[24h])) > 100
for: 1h
labels:
severity: warning
annotations:
summary: "AI API daily cost exceeds $100"
description: "24h rolling cost: $"
- alert: AIRateLimited
expr: rate(ai_api_rate_limited_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "AI API is being rate-limited"
All services emit structured JSON logs to stdout for container-native collection.
{
"timestamp": "2024-01-15T10:30:00.123Z",
"level": "info",
"service": "api",
"traceId": "abc123def456",
"spanId": "789ghi",
"method": "POST",
"path": "/api/v1/courses",
"statusCode": 201,
"duration_ms": 45,
"userId": "usr_abc123",
"tenantId": "tenant_001",
"message": "Course created successfully"
}
| Level | Usage |
|---|---|
error |
Unrecoverable errors, exceptions, failures |
warn |
Degraded functionality, approaching limits |
info |
Business events, request completions |
debug |
Detailed diagnostic info (disabled in production) |
# filebeat/filebeat.yml
filebeat.inputs:
- type: container
paths:
- /var/lib/docker/containers/*/*.log
processors:
- decode_json_fields:
fields: ["message"]
target: ""
overwrite_keys: true
output.elasticsearch:
hosts: ["elasticsearch:9200"]
indices:
- index: "skillquest-api-%{+yyyy.MM.dd}"
when.contains:
container.labels.service: "api"
- index: "skillquest-web-%{+yyyy.MM.dd}"
when.contains:
container.labels.service: "web"
- index: "skillquest-ai-%{+yyyy.MM.dd}"
when.contains:
container.labels.service: "ai"
# promtail/config.yml
server:
http_listen_port: 9080
positions:
filename: /var/run/promtail/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: containers
static_configs:
- targets: ["localhost"]
labels:
job: skillquest
__path__: /var/lib/docker/containers/*/*.log
pipeline_stages:
- json:
expressions:
level: level
service: service
traceId: traceId
- labels:
level:
service:
# Kibana KQL: Find all errors in the last hour
level: "error" AND @timestamp >= now-1h
# Kibana KQL: Slow API requests (>1s)
service: "api" AND duration_ms > 1000
# Loki LogQL: Error rate by service
sum(rate({job="skillquest"} | json | level="error" [5m])) by (service)
# Loki LogQL: Trace a specific request
{job="skillquest"} | json | traceId="abc123def456"
| Method | Frequency | Retention | Description |
|---|---|---|---|
pg_dump (full) |
Daily 02:00 UTC | 30 days | Logical full backup |
| WAL archiving | Continuous | 7 days | Point-in-time recovery (PITR) |
| pg_basebackup | Weekly | 4 weeks | Physical full backup |
#!/bin/bash
# scripts/backup-postgres.sh
BACKUP_DIR="/backups/postgres"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/skillquest_${TIMESTAMP}.dump"
# Full compressed backup
pg_dump -Fc -h localhost -U skillquest_admin skillquest > "${BACKUP_FILE}"
# Verify backup integrity
pg_restore --list "${BACKUP_FILE}" > /dev/null 2>&1
if [ $? -eq 0 ]; then
echo "Backup verified: ${BACKUP_FILE}"
else
echo "ERROR: Backup verification failed!" >&2
exit 1
fi
# Remove backups older than 30 days
find "${BACKUP_DIR}" -name "*.dump" -mtime +30 -delete
# Upload to object storage
aws s3 cp "${BACKUP_FILE}" s3://skillquest-backups/postgres/
# postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://skillquest-backups/wal/%f'
archive_timeout = 300
| Method | Frequency | Retention | Description |
|---|---|---|---|
| RDB | Every 15 minutes | 7 days | Point-in-time snapshot |
| AOF | Continuous (fsync) | 3 days | Append-only file |
# redis.conf
save 900 1
save 300 10
save 60 10000
appendonly yes
appendfsync everysec
| Metric | Target | Description |
|---|---|---|
| RTO | 4 hours | Maximum time to restore full service |
| RPO | 1 hour | Maximum acceptable data loss window |
DatabaseDown or health check failure.# Promote standby to primary
pg_ctl promote -D /var/lib/postgresql/data
# Update DNS/connection strings to point to new primary
# Stop the API to prevent data corruption
docker compose stop api web ai
# Restore from latest pg_dump
pg_restore -d skillquest /backups/postgres/latest.dump
# Apply WAL logs for point-in-time recovery
# Configure recovery.conf with restore_command
# Restart services
docker compose start api web ai
Symptoms: API returns 503 errors, logs show “connection pool exhausted”.
Diagnosis:
# Check active connections
psql -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
# Find long-running queries
psql -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '30 seconds'
ORDER BY duration DESC;"
Resolution:
psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity
WHERE state = 'active' AND now() - query_start > interval '5 minutes';"
Symptoms: Redis returns OOM errors, cache writes fail, application performance degrades.
Diagnosis:
# Check memory usage
redis-cli INFO memory
# Find large keys
redis-cli --bigkeys
# Check eviction policy
redis-cli CONFIG GET maxmemory-policy
Resolution:
redis-cli KEYS "cache:leaderboard:*" | xargs redis-cli DEL
maxmemory limit or scale Redis.allkeys-lru eviction policy.Symptoms: AI-powered features return errors or timeouts, logs show 429 status codes from OpenAI.
Diagnosis:
# Check rate limit headers in recent responses
grep "x-ratelimit" /var/log/skillquest/ai-service.log | tail -20
# Check current queue depth
redis-cli LLEN ai:generation:queue
Resolution:
# Update environment variable
AI_MAX_CONCURRENT=3
gpt-4o-mini) for non-critical requests.| Activity | Schedule | Duration | Impact |
|---|---|---|---|
| Database vacuum/analyze | Daily 03:00 UTC | ~10 min | None |
| Database migrations | As needed (announced) | 5–30 min | Brief downtime |
| Certificate renewal | Auto (Let’s Encrypt) | < 1 min | None |
| OS/security patches | Weekly Sunday 04:00 | ~15 min | Rolling update |
| Major version upgrades | Quarterly (announced) | 1–2 hours | Planned outage |
# Enable maintenance mode (returns 503 to all non-health requests)
kubectl -n skillquest set env deployment/skillquest-api MAINTENANCE_MODE=true
# Disable maintenance mode
kubectl -n skillquest set env deployment/skillquest-api MAINTENANCE_MODE=false
| Level | Response Time | Contact | Scope |
|---|---|---|---|
| P1 | 15 min | On-call engineer | Service down, data loss risk |
| P2 | 1 hour | On-call engineer | Degraded performance |
| P3 | 4 hours | Team lead | Non-critical issues |
| P4 | Next business day | Any engineer | Minor bugs, improvements |