100 Monitoring & Observability resources for developers
Effective monitoring in modern infrastructure requires moving beyond simple uptime checks to a unified observability strategy that handles high-cardinality data, distributed tracing for AI microservices, and precise LLM cost tracking. This guide provides a curated selection of tools and patterns to build a production-ready stack using OpenTelemetry, specialized LLM proxies, and open-source metric engines.
Core Infrastructure and Error Tracking
- 1
Sentry Error Tracking
beginnerhighImplement the Sentry SDK to capture unhandled exceptions and performance bottlenecks. Use the 'breadcrumbs' feature to reconstruct the state leading to a crash.
- 2
Prometheus Time-Series Database
intermediatehighDeploy Prometheus for scraping metrics from exporters. Use it to store system-level data like CPU/RAM and custom application business metrics.
- 3
Grafana Visualization
beginnerstandardConnect Prometheus and Loki data sources to build real-time dashboards. Use community templates for Node Exporter and PostgreSQL monitoring.
- 4
Better Stack Uptime
beginnerstandardConfigure external heartbeat monitoring and status pages. Integrate with Slack to receive immediate notification if the public endpoint fails.
- 5
VictoriaMetrics
advancedhighA high-performance, cost-effective drop-in replacement for Prometheus. Use it for long-term storage of metrics with better compression ratios.
- 6
Vector.dev Log Routing
intermediatemediumDeploy Vector as a sidecar or aggregator to collect, transform, and route logs from various sources to Loki, S3, or Datadog.
- 7
Grafana Loki
intermediatestandardA horizontally scalable, highly available, multi-tenant log aggregation system inspired by Prometheus. It indexes metadata rather than the full log content.
- 8
Checkly Synthetic Monitoring
intermediatehighWrite Playwright scripts to simulate user flows in production. Checkly runs these on a schedule to ensure critical paths (like checkout) work.
- 9
Netdata Real-time Monitoring
beginnermediumInstall on individual nodes for per-second granularity monitoring. Ideal for troubleshooting immediate performance spikes on specific servers.
- 10
Highlight.io Session Replay
beginnermediumOpen-source full-stack monitoring that includes session replay, allowing you to see exactly what the user did before an error occurred.
LLM Observability and Cost Management
- 1
Helicone Proxy
beginnerhighRoute OpenAI or Anthropic requests through Helicone to get an instant dashboard of costs, latency, and request/response logs.
- 2
LangSmith Tracing
intermediatehighUse LangChain's observability tool to debug complex chains. It visualizes the inputs and outputs of every step in your LLM pipeline.
- 3
LiteLLM Proxy
intermediatemediumA unified interface to call 100+ LLMs. Use its logging feature to export data to S3, Langfuse, or Mixpanel for centralized analysis.
- 4
Langfuse Open-Source Tracing
intermediatemediumSelf-hostable alternative for LLM application tracing. It provides specific metrics for token usage and latency across different model providers.
- 5
Arize Phoenix
advancedhighAn open-source library for ML observability that helps in evaluating LLM responses and detecting embedding drift in vector databases.
- 6
Portkey AI Gateway
intermediatemediumControl your LLM traffic with features like fallback, retries, and load balancing while collecting detailed observability data.
- 7
Promptfoo Evaluation
intermediatehighRun test cases against your prompts to measure output quality and regression before deploying new prompt versions to production.
- 8
Weights & Biases (W&B) Prompts
advancedmediumVisualize the execution flow of your LLM programs and track model performance over time with their specialized prompt logging.
- 9
Lunary LLM Stack
beginnerstandardOpen-source observability for LLMs including cost tracking, user analytics, and prompt versioning in a single dashboard.
- 10
Parea AI
intermediatemediumA developer platform to debug, test, and monitor LLM apps. It allows you to run evaluations on production data to find edge case failures.
Distributed Tracing and OpenTelemetry
- 1
OpenTelemetry Collector
advancedhighSet up a central collector to receive OTLP data and export it to multiple backends simultaneously (e.g., Jaeger and Datadog).
- 2
Jaeger Distributed Tracing
intermediatemediumDeploy Jaeger to visualize the path of a request across microservices. Essential for finding latency bottlenecks in complex AI pipelines.
- 3
SigNoz Observability Suite
intermediatehighAn open-source alternative to Datadog that combines metrics, traces, and logs in one UI, built on top of OpenTelemetry and ClickHouse.
- 4
Honeycomb.io High-Cardinality Analysis
advancedhighUse Honeycomb for debugging complex systems where you need to group by user ID, request ID, or other high-cardinality fields.
- 5
Grafana Tempo
advancedmediumA high-scale distributed tracing backend that is deeply integrated with Grafana and Loki, allowing you to jump from logs to traces.
- 6
Uptrace
intermediatestandardAn open-source distributed tracing tool that uses ClickHouse to store data and provides an easy-to-use UI for OTel data analysis.
- 7
HyperDX Unified Platform
beginnermediumAn open-source tool that correlates logs and traces automatically, making it easier to find the root cause of an error in a microservice.
- 8
ClickHouse for Observability
advancedhighUtilize ClickHouse as the storage engine for logs and traces due to its superior columnar compression and query performance for analytical data.
- 9
OpenTelemetry SDK Auto-Instrumentation
beginnerhighUse OTel zero-code instrumentation for Java, Python, or Node.js to start collecting traces without modifying your application code.
- 10
Pyroscope Continuous Profiling
advancedmediumIntegrate Pyroscope to continuously monitor CPU and memory usage at the function level, identifying lines of code causing performance regressions.