Checklists

Monitoring & Observability implementation checklist

This checklist outlines the technical requirements for establishing a robust observability stack, covering error tracking, LLM performance monitoring, infrastructure metrics, and alerting workflows to ensure production reliability and cost control.

Progress0 / 25 complete (0%)

Error Tracking & Exception Management

0/5
  • Global Exception Handler Integration

    critical

    Configure a global error handler (e.g., Sentry, Highlight.io) at the application entry point to capture unhandled exceptions and promise rejections.

  • Source Map Upload Pipeline

    critical

    Integrate source map uploads into the CI/CD pipeline to ensure stack traces map back to original source code rather than minified bundles.

  • User Context Enrichment

    recommended

    Attach non-PII user identifiers and session IDs to error reports to facilitate reproduction and impact assessment of user-specific bugs.

  • Release Tracking Configuration

    critical

    Tag every error with a specific commit SHA or version number to identify regressions introduced in specific deployments.

  • Breadcrumb Logging Implementation

    recommended

    Configure automatic capture of network requests, console logs, and state changes leading up to a crash for debugging context.

LLM Observability & Cost Control

0/5
  • Token Usage Instrumentation

    critical

    Implement middleware (e.g., Helicone, LangSmith) to log prompt and completion tokens for every LLM provider call to track unit costs.

  • Provider Latency Measurement

    critical

    Measure Time-To-First-Token (TTFT) and total request duration for all LLM API calls to identify provider-side performance degradation.

  • Model Version Tagging

    recommended

    Explicitly tag metrics with specific model versions (e.g., gpt-4-0613 vs gpt-4-turbo) to evaluate performance and cost deltas.

  • Request/Response Payload Logging

    recommended

    Securely log prompt inputs and model outputs for a sampled percentage of requests to monitor for hallucinations or drift.

  • Semantic Search Latency Tracking

    optional

    Monitor the latency of vector database queries and embedding generation steps in RAG pipelines.

Infrastructure & System Metrics

0/5
  • Node/Instance Resource Alerts

    critical

    Set up Prometheus or Datadog alerts for CPU utilization >85% and Memory utilization >90% sustained for over 5 minutes.

  • HTTP 5xx Error Rate Thresholds

    critical

    Configure alerts for server-side error rates exceeding 1% of total traffic over a 60-second window.

  • Database Connection Pool Monitoring

    critical

    Monitor active vs. maximum available database connections to prevent application hangs during traffic spikes.

  • Disk I/O and Space Monitoring

    recommended

    Establish alerts for disk space usage exceeding 80% and high I/O wait times on database volumes.

  • Queue Depth and Lag Tracking

    recommended

    For worker-based systems, monitor the number of pending messages and the time-to-process for background jobs.

Distributed Tracing & OpenTelemetry

0/5
  • Trace ID Propagation

    critical

    Ensure trace headers (e.g., W3C Trace Context) are passed across all service boundaries, including internal microservices and external proxies.

  • Database Query Instrumentation

    recommended

    Enable auto-instrumentation for database drivers to capture slow queries as spans within a distributed trace.

  • Sampling Rate Calibration

    recommended

    Configure trace sampling rates (e.g., 10% for high-volume, 100% for errors) to balance visibility with storage costs.

  • External API Dependency Mapping

    recommended

    Instrument all outgoing HTTP calls to third-party services to identify which external dependency is causing latency.

  • Async Task Trace Linking

    optional

    Ensure that background jobs inherit the trace context from the triggering HTTP request for end-to-end visibility.

Alerting & Incident Workflow

0/5
  • External Uptime Heartbeats

    critical

    Configure external probes (e.g., BetterStack) to check the /health endpoint from multiple geographic regions every 60 seconds.

  • On-Call Escalation Policy

    critical

    Define a clear escalation path in an incident management tool to ensure critical alerts page a human responder within 5 minutes.

  • Alert Severity Categorization

    recommended

    Distinguish between 'Critical' (page responder) and 'Warning' (Slack notification only) to prevent alert fatigue.

  • Synthetic Transaction Monitoring

    recommended

    Script a critical user path (e.g., login or checkout) to run every 5-15 minutes to verify functional correctness.

  • Post-Mortem Documentation Template

    optional

    Establish a standardized template for documenting root causes and action items after every high-severity incident.