Monitoring Setup Guide¶
This guide explains how to set up comprehensive monitoring for the AI Assistant System.
Overview¶
Monitoring is essential for maintaining system health, performance, and reliability. This guide covers setting up monitoring with Prometheus and Grafana.
Components¶
- Prometheus: Metrics collection and storage
- Grafana: Visualization and alerting
- Node Exporter: System metrics
- cAdvisor: Container metrics
Prometheus Setup¶
Configuration¶
Create a prometheus.yml
configuration file:
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
scrape_configs:
- job_name: 'ai-assistant'
static_configs:
- targets: ['app:8000']
metrics_path: '/metrics'
scrape_interval: 5s
- job_name: 'redis'
static_configs:
- targets: ['redis:6379']
- job_name: 'postgres'
static_configs:
- targets: ['postgres:5432']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
Docker Compose for Monitoring¶
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
- ./monitoring/alert_rules.yml:/etc/prometheus/alert_rules.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=200h'
- '--web.enable-lifecycle'
restart: unless-stopped
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana_data:/var/lib/grafana
- ./monitoring/grafana/provisioning:/etc/grafana/provisioning
- ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards
restart: unless-stopped
node-exporter:
image: prom/node-exporter:latest
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
restart: unless-stopped
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
restart: unless-stopped
alertmanager:
image: prom/alertmanager:latest
ports:
- "9093:9093"
volumes:
- ./monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager_data:/alertmanager
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
alertmanager_data:
Application Metrics¶
Prometheus Integration¶
Add Prometheus metrics to your application:
from prometheus_client import Counter, Histogram, Gauge, generate_latest, CONTENT_TYPE_LATEST
from fastapi import Response
import time
# Define metrics
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status_code']
)
REQUEST_DURATION = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint']
)
ACTIVE_CONNECTIONS = Gauge(
'active_connections',
'Active connections'
)
TOOL_EXECUTION_COUNT = Counter(
'tool_executions_total',
'Total tool executions',
['tool_name', 'status']
)
TOOL_EXECUTION_DURATION = Histogram(
'tool_execution_duration_seconds',
'Tool execution duration',
['tool_name']
)
CACHE_HIT_RATE = Gauge(
'cache_hit_rate',
'Cache hit rate',
['cache_type']
)
# Metrics middleware
@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
start_time = time.time()
# Increment active connections
ACTIVE_CONNECTIONS.inc()
try:
response = await call_next(request)
# Record metrics
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.url.path,
status_code=response.status_code
).inc()
REQUEST_DURATION.labels(
method=request.method,
endpoint=request.url.path
).observe(time.time() - start_time)
return response
finally:
# Decrement active connections
ACTIVE_CONNECTIONS.dec()
# Metrics endpoint
@app.get("/metrics")
async def metrics():
return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)
Tool Metrics¶
Add metrics for tool execution:
from app.core.tools.base import BaseTool, ToolResult
import time
class MetricsToolMixin:
async def execute_with_metrics(self, parameters) -> ToolResult:
tool_name = self.name
start_time = time.time()
try:
result = await self.execute(parameters)
# Record success metrics
TOOL_EXECUTION_COUNT.labels(
tool_name=tool_name,
status='success'
).inc()
return result
except Exception as e:
# Record failure metrics
TOOL_EXECUTION_COUNT.labels(
tool_name=tool_name,
status='error'
).inc()
raise
finally:
# Record execution time
TOOL_EXECUTION_DURATION.labels(
tool_name=tool_name
).observe(time.time() - start_time)
Grafana Dashboards¶
Application Dashboard¶
Create a comprehensive dashboard for monitoring the application:
- Request Rate: Total requests per second
- Response Time: P95, P99 response times
- Error Rate: Percentage of failed requests
- Active Connections: Current active connections
- Tool Usage: Most used tools
- Cache Performance: Hit rates and miss rates
System Dashboard¶
Monitor system resources:
- CPU Usage: System and application CPU usage
- Memory Usage: Memory consumption
- Disk Usage: Disk space and I/O
- Network Traffic: Network I/O
- Container Metrics: Container resource usage
Alerting¶
Alert Rules¶
Create alert_rules.yml
:
groups:
- name: ai-assistant-alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} for the last 5 minutes"
- alert: HighResponseTime
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High response time detected"
description: "95th percentile response time is {{ $value }}s"
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Memory usage is {{ $value | humanizePercentage }}"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space"
description: "Disk space is {{ $value | humanizePercentage }} available"
- alert: ToolExecutionFailure
expr: rate(tool_executions_total{status="error"}[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "High tool failure rate"
description: "Tool {{ $labels.tool_name }} failure rate is {{ $value | humanizePercentage }}"
Alertmanager Configuration¶
Create alertmanager.yml
:
global:
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alerts@example.com'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
email_configs:
- to: 'admin@example.com'
subject: '[AI Assistant Alert] {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ end }}
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK_URL'
channel: '#alerts'
title: 'AI Assistant Alert'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
Log Monitoring¶
Structured Logging¶
Implement structured logging for better log analysis:
import structlog
structlog.configure(
processors=[
structlog.stdlib.filter_by_level,
structlog.stdlib.add_logger_name,
structlog.stdlib.add_log_level,
structlog.stdlib.PositionalArgumentsFormatter(),
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
structlog.processors.UnicodeDecoder(),
structlog.processors.JSONRenderer()
],
context_class=dict,
logger_factory=structlog.stdlib.LoggerFactory(),
wrapper_class=structlog.stdlib.BoundLogger,
cache_logger_on_first_use=True,
)
logger = structlog.get_logger()
Log Aggregation¶
Set up log aggregation with ELK stack or similar:
version: '3.8'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.5.0
environment:
- discovery.type=single-node
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
ports:
- "9200:9200"
volumes:
- elasticsearch_data:/usr/share/elasticsearch/data
logstash:
image: docker.elastic.co/logstash/logstash:8.5.0
ports:
- "5044:5044"
volumes:
- ./monitoring/logstash.conf:/usr/share/logstash/pipeline/logstash.conf
depends_on:
- elasticsearch
kibana:
image: docker.elastic.co/kibana/kibana:8.5.0
ports:
- "5601:5601"
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
depends_on:
- elasticsearch
volumes:
elasticsearch_data:
Best Practices¶
- Relevant Metrics: Monitor metrics that matter for your business
- Thresholds: Set appropriate alert thresholds
- Testing: Regularly test alerting and monitoring systems
- Documentation: Document monitoring setup and runbooks
- Review: Regularly review and update monitoring configuration
- Retention: Configure appropriate data retention policies
- Security: Secure access to monitoring systems