Checklists

Monitoring & Observability implementation checklist

This checklist outlines the technical requirements for establishing a robust observability stack, covering error tracking, LLM performance monitoring, infrastructure metrics, and alerting workflows to ensure production reliability and cost control.

Progress0 / 25 complete (0%)

Error Tracking & Exception Management

0/5

Global Exception Handler Integration
critical
Configure a global error handler (e.g., Sentry, Highlight.io) at the application entry point to capture unhandled exceptions and promise rejections.
Source Map Upload Pipeline
critical
Integrate source map uploads into the CI/CD pipeline to ensure stack traces map back to original source code rather than minified bundles.
User Context Enrichment
recommended
Attach non-PII user identifiers and session IDs to error reports to facilitate reproduction and impact assessment of user-specific bugs.
Release Tracking Configuration
critical
Tag every error with a specific commit SHA or version number to identify regressions introduced in specific deployments.
Breadcrumb Logging Implementation
recommended
Configure automatic capture of network requests, console logs, and state changes leading up to a crash for debugging context.

LLM Observability & Cost Control

0/5

Token Usage Instrumentation
critical
Implement middleware (e.g., Helicone, LangSmith) to log prompt and completion tokens for every LLM provider call to track unit costs.
Provider Latency Measurement
critical
Measure Time-To-First-Token (TTFT) and total request duration for all LLM API calls to identify provider-side performance degradation.
Model Version Tagging
recommended
Explicitly tag metrics with specific model versions (e.g., gpt-4-0613 vs gpt-4-turbo) to evaluate performance and cost deltas.
Request/Response Payload Logging
recommended
Securely log prompt inputs and model outputs for a sampled percentage of requests to monitor for hallucinations or drift.
Semantic Search Latency Tracking
optional
Monitor the latency of vector database queries and embedding generation steps in RAG pipelines.

Infrastructure & System Metrics

0/5

Node/Instance Resource Alerts
critical
Set up Prometheus or Datadog alerts for CPU utilization >85% and Memory utilization >90% sustained for over 5 minutes.
HTTP 5xx Error Rate Thresholds
critical
Configure alerts for server-side error rates exceeding 1% of total traffic over a 60-second window.
Database Connection Pool Monitoring
critical
Monitor active vs. maximum available database connections to prevent application hangs during traffic spikes.
Disk I/O and Space Monitoring
recommended
Establish alerts for disk space usage exceeding 80% and high I/O wait times on database volumes.
Queue Depth and Lag Tracking
recommended
For worker-based systems, monitor the number of pending messages and the time-to-process for background jobs.

Distributed Tracing & OpenTelemetry

0/5

Trace ID Propagation
critical
Ensure trace headers (e.g., W3C Trace Context) are passed across all service boundaries, including internal microservices and external proxies.
Database Query Instrumentation
recommended
Enable auto-instrumentation for database drivers to capture slow queries as spans within a distributed trace.
Sampling Rate Calibration
recommended
Configure trace sampling rates (e.g., 10% for high-volume, 100% for errors) to balance visibility with storage costs.
External API Dependency Mapping
recommended
Instrument all outgoing HTTP calls to third-party services to identify which external dependency is causing latency.
Async Task Trace Linking
optional
Ensure that background jobs inherit the trace context from the triggering HTTP request for end-to-end visibility.

Alerting & Incident Workflow

0/5

External Uptime Heartbeats
critical
Configure external probes (e.g., BetterStack) to check the /health endpoint from multiple geographic regions every 60 seconds.
On-Call Escalation Policy
critical
Define a clear escalation path in an incident management tool to ensure critical alerts page a human responder within 5 minutes.
Alert Severity Categorization
recommended
Distinguish between 'Critical' (page responder) and 'Warning' (Slack notification only) to prevent alert fatigue.
Synthetic Transaction Monitoring
recommended
Script a critical user path (e.g., login or checkout) to run every 5-15 minutes to verify functional correctness.
Post-Mortem Documentation Template
optional
Establish a standardized template for documenting root causes and action items after every high-severity incident.