Monitoring & Observability implementation checklist
This checklist outlines the technical requirements for establishing a robust observability stack, covering error tracking, LLM performance monitoring, infrastructure metrics, and alerting workflows to ensure production reliability and cost control.
Error Tracking & Exception Management
0/5Global Exception Handler Integration
criticalConfigure a global error handler (e.g., Sentry, Highlight.io) at the application entry point to capture unhandled exceptions and promise rejections.
Source Map Upload Pipeline
criticalIntegrate source map uploads into the CI/CD pipeline to ensure stack traces map back to original source code rather than minified bundles.
User Context Enrichment
recommendedAttach non-PII user identifiers and session IDs to error reports to facilitate reproduction and impact assessment of user-specific bugs.
Release Tracking Configuration
criticalTag every error with a specific commit SHA or version number to identify regressions introduced in specific deployments.
Breadcrumb Logging Implementation
recommendedConfigure automatic capture of network requests, console logs, and state changes leading up to a crash for debugging context.
LLM Observability & Cost Control
0/5Token Usage Instrumentation
criticalImplement middleware (e.g., Helicone, LangSmith) to log prompt and completion tokens for every LLM provider call to track unit costs.
Provider Latency Measurement
criticalMeasure Time-To-First-Token (TTFT) and total request duration for all LLM API calls to identify provider-side performance degradation.
Model Version Tagging
recommendedExplicitly tag metrics with specific model versions (e.g., gpt-4-0613 vs gpt-4-turbo) to evaluate performance and cost deltas.
Request/Response Payload Logging
recommendedSecurely log prompt inputs and model outputs for a sampled percentage of requests to monitor for hallucinations or drift.
Semantic Search Latency Tracking
optionalMonitor the latency of vector database queries and embedding generation steps in RAG pipelines.
Infrastructure & System Metrics
0/5Node/Instance Resource Alerts
criticalSet up Prometheus or Datadog alerts for CPU utilization >85% and Memory utilization >90% sustained for over 5 minutes.
HTTP 5xx Error Rate Thresholds
criticalConfigure alerts for server-side error rates exceeding 1% of total traffic over a 60-second window.
Database Connection Pool Monitoring
criticalMonitor active vs. maximum available database connections to prevent application hangs during traffic spikes.
Disk I/O and Space Monitoring
recommendedEstablish alerts for disk space usage exceeding 80% and high I/O wait times on database volumes.
Queue Depth and Lag Tracking
recommendedFor worker-based systems, monitor the number of pending messages and the time-to-process for background jobs.
Distributed Tracing & OpenTelemetry
0/5Trace ID Propagation
criticalEnsure trace headers (e.g., W3C Trace Context) are passed across all service boundaries, including internal microservices and external proxies.
Database Query Instrumentation
recommendedEnable auto-instrumentation for database drivers to capture slow queries as spans within a distributed trace.
Sampling Rate Calibration
recommendedConfigure trace sampling rates (e.g., 10% for high-volume, 100% for errors) to balance visibility with storage costs.
External API Dependency Mapping
recommendedInstrument all outgoing HTTP calls to third-party services to identify which external dependency is causing latency.
Async Task Trace Linking
optionalEnsure that background jobs inherit the trace context from the triggering HTTP request for end-to-end visibility.
Alerting & Incident Workflow
0/5External Uptime Heartbeats
criticalConfigure external probes (e.g., BetterStack) to check the /health endpoint from multiple geographic regions every 60 seconds.
On-Call Escalation Policy
criticalDefine a clear escalation path in an incident management tool to ensure critical alerts page a human responder within 5 minutes.
Alert Severity Categorization
recommendedDistinguish between 'Critical' (page responder) and 'Warning' (Slack notification only) to prevent alert fatigue.
Synthetic Transaction Monitoring
recommendedScript a critical user path (e.g., login or checkout) to run every 5-15 minutes to verify functional correctness.
Post-Mortem Documentation Template
optionalEstablish a standardized template for documenting root causes and action items after every high-severity incident.