TOB-

SkillQuest Operations Manual

Day-to-day operational procedures, monitoring, alerting, backup, and disaster recovery.

Monitoring Metrics
Prometheus Alerting Rules
Log Collection and Analysis
Backup Strategy
Disaster Recovery Procedures
Common Issue Runbooks
Maintenance Windows
On-Call Procedures

Monitoring Metrics

Key Performance Indicators (KPIs)

All services expose metrics via Prometheus-compatible /metrics endpoints.

Metric	Type	Target	Alert Threshold
API latency (p50)	Histogram	< 50 ms	> 100 ms
API latency (p99)	Histogram	< 500 ms	> 2,000 ms
API error rate (5xx)	Counter	< 0.1%	> 1%
DB query time (p95)	Histogram	< 50 ms	> 200 ms
DB connection pool utilization	Gauge	< 70%	> 85%
WebSocket active connections	Gauge	—	> 5,000
Redis memory usage	Gauge	< 70%	> 85%
Redis hit rate	Gauge	> 90%	< 80%
Redis operations per second	Counter	—	> 40,000
AI API response time (p95)	Histogram	< 5,000 ms	> 15,000 ms
AI API daily cost (USD)	Gauge	< $50/day	> $100/day
AI API error rate	Counter	< 1%	> 5%
Node.js heap usage	Gauge	< 70%	> 85%
CPU utilization (per pod)	Gauge	< 60%	> 80%
Memory utilization (per pod)	Gauge	< 70%	> 85%
Disk I/O (PostgreSQL)	Counter	—	> 80% of capacity

Grafana Dashboards

Recommended dashboards (import IDs for Grafana):

Dashboard	Grafana ID	Description
Node.js Application	11159	Process metrics, event loop lag
PostgreSQL Overview	9628	Query performance, connections
Redis Overview	11835	Memory, hit rate, ops/sec
Nginx Overview	12708	Requests, upstream latency
Kubernetes Cluster	6417	Pod resources, node health

Prometheus Alerting Rules

API Service Alerts

# prometheus/rules/api-alerts.yml
groups:
  - name: skillquest-api
    rules:
      - alert: HighAPILatency
        expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service="api"}[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "API p99 latency exceeds 2s"
          description: "API p99 latency is s on "

      - alert: HighAPIErrorRate
        expr: rate(http_requests_total{service="api", status=~"5.."}[5m]) / rate(http_requests_total{service="api"}[5m]) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "API 5xx error rate exceeds 1%"
          description: "Error rate is  on "

      - alert: APIDown
        expr: up{job="skillquest-api"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "API service is down"
          description: " has been down for more than 1 minute"

Database Alerts

# prometheus/rules/db-alerts.yml
groups:
  - name: skillquest-database
    rules:
      - alert: DatabaseConnectionPoolExhausted
        expr: pg_stat_activity_count / pg_settings_max_connections > 0.85
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "DB connection pool above 85%"
          description: "Active connections:  of max"

      - alert: SlowQueries
        expr: pg_stat_activity_max_tx_duration{state="active"} > 30
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Long-running query detected (>30s)"

      - alert: DatabaseDiskSpaceLow
        expr: pg_database_size_bytes / node_filesystem_size_bytes * 100 > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Database disk usage above 80%"

Redis Alerts

# prometheus/rules/redis-alerts.yml
groups:
  - name: skillquest-redis
    rules:
      - alert: RedisMemoryHigh
        expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Redis memory usage above 85%"

      - alert: RedisHitRateLow
        expr: redis_keyspace_hits_total / (redis_keyspace_hits_total + redis_keyspace_misses_total) < 0.80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Redis hit rate below 80%"

      - alert: RedisDown
        expr: redis_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Redis is unreachable"

AI Service Alerts

# prometheus/rules/ai-alerts.yml
groups:
  - name: skillquest-ai
    rules:
      - alert: AIHighLatency
        expr: histogram_quantile(0.95, rate(ai_request_duration_seconds_bucket[5m])) > 15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "AI API p95 latency exceeds 15s"

      - alert: AIHighCost
        expr: sum(increase(ai_api_cost_usd_total[24h])) > 100
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "AI API daily cost exceeds $100"
          description: "24h rolling cost: $"

      - alert: AIRateLimited
        expr: rate(ai_api_rate_limited_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "AI API is being rate-limited"

Log Collection and Analysis

Log Format

All services emit structured JSON logs to stdout for container-native collection.

{
  "timestamp": "2024-01-15T10:30:00.123Z",
  "level": "info",
  "service": "api",
  "traceId": "abc123def456",
  "spanId": "789ghi",
  "method": "POST",
  "path": "/api/v1/courses",
  "statusCode": 201,
  "duration_ms": 45,
  "userId": "usr_abc123",
  "tenantId": "tenant_001",
  "message": "Course created successfully"
}

Log Levels

Level	Usage
`error`	Unrecoverable errors, exceptions, failures
`warn`	Degraded functionality, approaching limits
`info`	Business events, request completions
`debug`	Detailed diagnostic info (disabled in production)

ELK Stack Configuration

# filebeat/filebeat.yml
filebeat.inputs:
  - type: container
    paths:
      - /var/lib/docker/containers/*/*.log
    processors:
      - decode_json_fields:
          fields: ["message"]
          target: ""
          overwrite_keys: true

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  indices:
    - index: "skillquest-api-%{+yyyy.MM.dd}"
      when.contains:
        container.labels.service: "api"
    - index: "skillquest-web-%{+yyyy.MM.dd}"
      when.contains:
        container.labels.service: "web"
    - index: "skillquest-ai-%{+yyyy.MM.dd}"
      when.contains:
        container.labels.service: "ai"

Grafana Loki Alternative

# promtail/config.yml
server:
  http_listen_port: 9080

positions:
  filename: /var/run/promtail/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: containers
    static_configs:
      - targets: ["localhost"]
        labels:
          job: skillquest
          __path__: /var/lib/docker/containers/*/*.log
    pipeline_stages:
      - json:
          expressions:
            level: level
            service: service
            traceId: traceId
      - labels:
          level:
          service:

Useful Log Queries

# Kibana KQL: Find all errors in the last hour
level: "error" AND @timestamp >= now-1h

# Kibana KQL: Slow API requests (>1s)
service: "api" AND duration_ms > 1000

# Loki LogQL: Error rate by service
sum(rate({job="skillquest"} | json | level="error" [5m])) by (service)

# Loki LogQL: Trace a specific request
{job="skillquest"} | json | traceId="abc123def456"

Backup Strategy

PostgreSQL Backup

Method	Frequency	Retention	Description
`pg_dump` (full)	Daily 02:00 UTC	30 days	Logical full backup
WAL archiving	Continuous	7 days	Point-in-time recovery (PITR)
pg_basebackup	Weekly	4 weeks	Physical full backup

Automated pg_dump Script

#!/bin/bash
# scripts/backup-postgres.sh

BACKUP_DIR="/backups/postgres"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/skillquest_${TIMESTAMP}.dump"

# Full compressed backup
pg_dump -Fc -h localhost -U skillquest_admin skillquest > "${BACKUP_FILE}"

# Verify backup integrity
pg_restore --list "${BACKUP_FILE}" > /dev/null 2>&1
if [ $? -eq 0 ]; then
  echo "Backup verified: ${BACKUP_FILE}"
else
  echo "ERROR: Backup verification failed!" >&2
  exit 1
fi

# Remove backups older than 30 days
find "${BACKUP_DIR}" -name "*.dump" -mtime +30 -delete

# Upload to object storage
aws s3 cp "${BACKUP_FILE}" s3://skillquest-backups/postgres/

WAL Archiving Configuration

# postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://skillquest-backups/wal/%f'
archive_timeout = 300

Redis Backup

Method	Frequency	Retention	Description
RDB	Every 15 minutes	7 days	Point-in-time snapshot
AOF	Continuous (fsync)	3 days	Append-only file

# redis.conf
save 900 1
save 300 10
save 60 10000
appendonly yes
appendfsync everysec

Recovery Objectives

Metric	Target	Description
RTO	4 hours	Maximum time to restore full service
RPO	1 hour	Maximum acceptable data loss window

Disaster Recovery Procedures

Scenario 1: Database Failure

Detect: Automated alert from DatabaseDown or health check failure.
Assess: Check PostgreSQL logs and connection status.

Failover (if using replication):

# Promote standby to primary
pg_ctl promote -D /var/lib/postgresql/data
# Update DNS/connection strings to point to new primary

Restore from backup (if no standby):

# Stop the API to prevent data corruption
docker compose stop api web ai

# Restore from latest pg_dump
pg_restore -d skillquest /backups/postgres/latest.dump

# Apply WAL logs for point-in-time recovery
# Configure recovery.conf with restore_command

# Restart services
docker compose start api web ai

Verify: Run health checks and spot-check recent data.

Scenario 2: Complete Infrastructure Loss

Provision new infrastructure (Terraform or manual).
Restore PostgreSQL from S3 backup.
Restore Redis (or allow cache warming from DB).
Deploy application via Helm or Docker Compose.
Update DNS records.
Verify all health checks and run smoke tests.

Scenario 3: Data Corruption

Identify the corruption timestamp from audit logs.
Stop write traffic (enable maintenance mode).
Restore to the point-in-time just before corruption using WAL PITR.
Verify data integrity.
Resume traffic.

Common Issue Runbooks

Runbook 1: Database Connection Pool Exhaustion

Symptoms: API returns 503 errors, logs show “connection pool exhausted”.

Diagnosis:

# Check active connections
psql -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"

# Find long-running queries
psql -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query
         FROM pg_stat_activity
         WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '30 seconds'
         ORDER BY duration DESC;"

Resolution:

Kill long-running queries:

psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity
         WHERE state = 'active' AND now() - query_start > interval '5 minutes';"

Increase pool size temporarily in environment variables.
Restart API pods to release stale connections.
Investigate and fix the root cause (missing indexes, N+1 queries).

Runbook 2: Redis Out-of-Memory (OOM)

Symptoms: Redis returns OOM errors, cache writes fail, application performance degrades.

Diagnosis:

# Check memory usage
redis-cli INFO memory

# Find large keys
redis-cli --bigkeys

# Check eviction policy
redis-cli CONFIG GET maxmemory-policy

Resolution:

Immediate: Flush non-critical cache namespaces:

redis-cli KEYS "cache:leaderboard:*" | xargs redis-cli DEL

Short-term: Increase maxmemory limit or scale Redis.
Long-term:
- Set appropriate TTLs on all cache keys.
- Switch to allkeys-lru eviction policy.
- Audit cache key sizes and optimize serialization.

Runbook 3: AI API Rate Limiting

Symptoms: AI-powered features return errors or timeouts, logs show 429 status codes from OpenAI.

Diagnosis:

# Check rate limit headers in recent responses
grep "x-ratelimit" /var/log/skillquest/ai-service.log | tail -20

# Check current queue depth
redis-cli LLEN ai:generation:queue

Resolution:

Immediate: Enable request queuing with exponential backoff (already built-in).

Short-term: Reduce concurrent AI requests:

# Update environment variable
AI_MAX_CONCURRENT=3

Long-term:
- Implement aggressive caching for similar prompts.
- Pre-generate common question types during off-peak hours.
- Request higher rate limits from OpenAI.
- Consider a fallback model (e.g., gpt-4o-mini) for non-critical requests.

Maintenance Windows

Scheduled Maintenance

Activity	Schedule	Duration	Impact
Database vacuum/analyze	Daily 03:00 UTC	~10 min	None
Database migrations	As needed (announced)	5–30 min	Brief downtime
Certificate renewal	Auto (Let’s Encrypt)	< 1 min	None
OS/security patches	Weekly Sunday 04:00	~15 min	Rolling update
Major version upgrades	Quarterly (announced)	1–2 hours	Planned outage

Maintenance Mode

# Enable maintenance mode (returns 503 to all non-health requests)
kubectl -n skillquest set env deployment/skillquest-api MAINTENANCE_MODE=true

# Disable maintenance mode
kubectl -n skillquest set env deployment/skillquest-api MAINTENANCE_MODE=false

On-Call Procedures

Escalation Path

Level	Response Time	Contact	Scope
P1	15 min	On-call engineer	Service down, data loss risk
P2	1 hour	On-call engineer	Degraded performance
P3	4 hours	Team lead	Non-critical issues
P4	Next business day	Any engineer	Minor bugs, improvements

On-Call Checklist

Acknowledge the alert within the SLA.
Check the relevant Grafana dashboard.
Follow the appropriate runbook.
If not resolved in 30 minutes, escalate.
Post an incident report within 24 hours.

TOB-

SkillQuest Operations Manual

Table of Contents

Monitoring Metrics

Key Performance Indicators (KPIs)

Grafana Dashboards

Prometheus Alerting Rules

API Service Alerts

Database Alerts

Redis Alerts

AI Service Alerts

Log Collection and Analysis

Log Format

Log Levels

ELK Stack Configuration

Grafana Loki Alternative

Useful Log Queries

Backup Strategy

PostgreSQL Backup

Automated pg_dump Script

WAL Archiving Configuration

Redis Backup

Recovery Objectives

Disaster Recovery Procedures

Scenario 1: Database Failure

Scenario 2: Complete Infrastructure Loss

Scenario 3: Data Corruption

Common Issue Runbooks

Runbook 1: Database Connection Pool Exhaustion

Runbook 2: Redis Out-of-Memory (OOM)

Runbook 3: AI API Rate Limiting

Maintenance Windows

Scheduled Maintenance

Maintenance Mode

On-Call Procedures

Escalation Path

On-Call Checklist