Resources

100 Monitoring & Observability resources for developers

Effective monitoring in modern infrastructure requires moving beyond simple uptime checks to a unified observability strategy that handles high-cardinality data, distributed tracing for AI microservices, and precise LLM cost tracking. This guide provides a curated selection of tools and patterns to build a production-ready stack using OpenTelemetry, specialized LLM proxies, and open-source metric engines.

Core Infrastructure and Error Tracking

  1. 1

    Sentry Error Tracking

    beginnerhigh

    Implement the Sentry SDK to capture unhandled exceptions and performance bottlenecks. Use the 'breadcrumbs' feature to reconstruct the state leading to a crash.

  2. 2

    Prometheus Time-Series Database

    intermediatehigh

    Deploy Prometheus for scraping metrics from exporters. Use it to store system-level data like CPU/RAM and custom application business metrics.

  3. 3

    Grafana Visualization

    beginnerstandard

    Connect Prometheus and Loki data sources to build real-time dashboards. Use community templates for Node Exporter and PostgreSQL monitoring.

  4. 4

    Better Stack Uptime

    beginnerstandard

    Configure external heartbeat monitoring and status pages. Integrate with Slack to receive immediate notification if the public endpoint fails.

  5. 5

    VictoriaMetrics

    advancedhigh

    A high-performance, cost-effective drop-in replacement for Prometheus. Use it for long-term storage of metrics with better compression ratios.

  6. 6

    Vector.dev Log Routing

    intermediatemedium

    Deploy Vector as a sidecar or aggregator to collect, transform, and route logs from various sources to Loki, S3, or Datadog.

  7. 7

    Grafana Loki

    intermediatestandard

    A horizontally scalable, highly available, multi-tenant log aggregation system inspired by Prometheus. It indexes metadata rather than the full log content.

  8. 8

    Checkly Synthetic Monitoring

    intermediatehigh

    Write Playwright scripts to simulate user flows in production. Checkly runs these on a schedule to ensure critical paths (like checkout) work.

  9. 9

    Netdata Real-time Monitoring

    beginnermedium

    Install on individual nodes for per-second granularity monitoring. Ideal for troubleshooting immediate performance spikes on specific servers.

  10. 10

    Highlight.io Session Replay

    beginnermedium

    Open-source full-stack monitoring that includes session replay, allowing you to see exactly what the user did before an error occurred.

LLM Observability and Cost Management

  1. 1

    Helicone Proxy

    beginnerhigh

    Route OpenAI or Anthropic requests through Helicone to get an instant dashboard of costs, latency, and request/response logs.

  2. 2

    LangSmith Tracing

    intermediatehigh

    Use LangChain's observability tool to debug complex chains. It visualizes the inputs and outputs of every step in your LLM pipeline.

  3. 3

    LiteLLM Proxy

    intermediatemedium

    A unified interface to call 100+ LLMs. Use its logging feature to export data to S3, Langfuse, or Mixpanel for centralized analysis.

  4. 4

    Langfuse Open-Source Tracing

    intermediatemedium

    Self-hostable alternative for LLM application tracing. It provides specific metrics for token usage and latency across different model providers.

  5. 5

    Arize Phoenix

    advancedhigh

    An open-source library for ML observability that helps in evaluating LLM responses and detecting embedding drift in vector databases.

  6. 6

    Portkey AI Gateway

    intermediatemedium

    Control your LLM traffic with features like fallback, retries, and load balancing while collecting detailed observability data.

  7. 7

    Promptfoo Evaluation

    intermediatehigh

    Run test cases against your prompts to measure output quality and regression before deploying new prompt versions to production.

  8. 8

    Weights & Biases (W&B) Prompts

    advancedmedium

    Visualize the execution flow of your LLM programs and track model performance over time with their specialized prompt logging.

  9. 9

    Lunary LLM Stack

    beginnerstandard

    Open-source observability for LLMs including cost tracking, user analytics, and prompt versioning in a single dashboard.

  10. 10

    Parea AI

    intermediatemedium

    A developer platform to debug, test, and monitor LLM apps. It allows you to run evaluations on production data to find edge case failures.

Distributed Tracing and OpenTelemetry

  1. 1

    OpenTelemetry Collector

    advancedhigh

    Set up a central collector to receive OTLP data and export it to multiple backends simultaneously (e.g., Jaeger and Datadog).

  2. 2

    Jaeger Distributed Tracing

    intermediatemedium

    Deploy Jaeger to visualize the path of a request across microservices. Essential for finding latency bottlenecks in complex AI pipelines.

  3. 3

    SigNoz Observability Suite

    intermediatehigh

    An open-source alternative to Datadog that combines metrics, traces, and logs in one UI, built on top of OpenTelemetry and ClickHouse.

  4. 4

    Honeycomb.io High-Cardinality Analysis

    advancedhigh

    Use Honeycomb for debugging complex systems where you need to group by user ID, request ID, or other high-cardinality fields.

  5. 5

    Grafana Tempo

    advancedmedium

    A high-scale distributed tracing backend that is deeply integrated with Grafana and Loki, allowing you to jump from logs to traces.

  6. 6

    Uptrace

    intermediatestandard

    An open-source distributed tracing tool that uses ClickHouse to store data and provides an easy-to-use UI for OTel data analysis.

  7. 7

    HyperDX Unified Platform

    beginnermedium

    An open-source tool that correlates logs and traces automatically, making it easier to find the root cause of an error in a microservice.

  8. 8

    ClickHouse for Observability

    advancedhigh

    Utilize ClickHouse as the storage engine for logs and traces due to its superior columnar compression and query performance for analytical data.

  9. 9

    OpenTelemetry SDK Auto-Instrumentation

    beginnerhigh

    Use OTel zero-code instrumentation for Java, Python, or Node.js to start collecting traces without modifying your application code.

  10. 10

    Pyroscope Continuous Profiling

    advancedmedium

    Integrate Pyroscope to continuously monitor CPU and memory usage at the function level, identifying lines of code causing performance regressions.