Resources

100 Monitoring & Observability resources for developers

Effective monitoring in modern infrastructure requires moving beyond simple uptime checks to a unified observability strategy that handles high-cardinality data, distributed tracing for AI microservices, and precise LLM cost tracking. This guide provides a curated selection of tools and patterns to build a production-ready stack using OpenTelemetry, specialized LLM proxies, and open-source metric engines.

Core Infrastructure and Error Tracking

1
Sentry Error Tracking
beginnerhigh
Implement the Sentry SDK to capture unhandled exceptions and performance bottlenecks. Use the 'breadcrumbs' feature to reconstruct the state leading to a crash.
2
Prometheus Time-Series Database
intermediatehigh
Deploy Prometheus for scraping metrics from exporters. Use it to store system-level data like CPU/RAM and custom application business metrics.
3
Grafana Visualization
beginnerstandard
Connect Prometheus and Loki data sources to build real-time dashboards. Use community templates for Node Exporter and PostgreSQL monitoring.
4
Better Stack Uptime
beginnerstandard
Configure external heartbeat monitoring and status pages. Integrate with Slack to receive immediate notification if the public endpoint fails.
5
VictoriaMetrics
advancedhigh
A high-performance, cost-effective drop-in replacement for Prometheus. Use it for long-term storage of metrics with better compression ratios.
6
Vector.dev Log Routing
intermediatemedium
Deploy Vector as a sidecar or aggregator to collect, transform, and route logs from various sources to Loki, S3, or Datadog.
7
Grafana Loki
intermediatestandard
A horizontally scalable, highly available, multi-tenant log aggregation system inspired by Prometheus. It indexes metadata rather than the full log content.
8
Checkly Synthetic Monitoring
intermediatehigh
Write Playwright scripts to simulate user flows in production. Checkly runs these on a schedule to ensure critical paths (like checkout) work.
9
Netdata Real-time Monitoring
beginnermedium
Install on individual nodes for per-second granularity monitoring. Ideal for troubleshooting immediate performance spikes on specific servers.
10
Highlight.io Session Replay
beginnermedium
Open-source full-stack monitoring that includes session replay, allowing you to see exactly what the user did before an error occurred.

LLM Observability and Cost Management

1
Helicone Proxy
beginnerhigh
Route OpenAI or Anthropic requests through Helicone to get an instant dashboard of costs, latency, and request/response logs.
2
LangSmith Tracing
intermediatehigh
Use LangChain's observability tool to debug complex chains. It visualizes the inputs and outputs of every step in your LLM pipeline.
3
LiteLLM Proxy
intermediatemedium
A unified interface to call 100+ LLMs. Use its logging feature to export data to S3, Langfuse, or Mixpanel for centralized analysis.
4
Langfuse Open-Source Tracing
intermediatemedium
Self-hostable alternative for LLM application tracing. It provides specific metrics for token usage and latency across different model providers.
5
Arize Phoenix
advancedhigh
An open-source library for ML observability that helps in evaluating LLM responses and detecting embedding drift in vector databases.
6
Portkey AI Gateway
intermediatemedium
Control your LLM traffic with features like fallback, retries, and load balancing while collecting detailed observability data.
7
Promptfoo Evaluation
intermediatehigh
Run test cases against your prompts to measure output quality and regression before deploying new prompt versions to production.
8
Weights & Biases (W&B) Prompts
advancedmedium
Visualize the execution flow of your LLM programs and track model performance over time with their specialized prompt logging.
9
Lunary LLM Stack
beginnerstandard
Open-source observability for LLMs including cost tracking, user analytics, and prompt versioning in a single dashboard.
10
Parea AI
intermediatemedium
A developer platform to debug, test, and monitor LLM apps. It allows you to run evaluations on production data to find edge case failures.

Distributed Tracing and OpenTelemetry

1
OpenTelemetry Collector
advancedhigh
Set up a central collector to receive OTLP data and export it to multiple backends simultaneously (e.g., Jaeger and Datadog).
2
Jaeger Distributed Tracing
intermediatemedium
Deploy Jaeger to visualize the path of a request across microservices. Essential for finding latency bottlenecks in complex AI pipelines.
3
SigNoz Observability Suite
intermediatehigh
An open-source alternative to Datadog that combines metrics, traces, and logs in one UI, built on top of OpenTelemetry and ClickHouse.
4
Honeycomb.io High-Cardinality Analysis
advancedhigh
Use Honeycomb for debugging complex systems where you need to group by user ID, request ID, or other high-cardinality fields.
5
Grafana Tempo
advancedmedium
A high-scale distributed tracing backend that is deeply integrated with Grafana and Loki, allowing you to jump from logs to traces.
6
Uptrace
intermediatestandard
An open-source distributed tracing tool that uses ClickHouse to store data and provides an easy-to-use UI for OTel data analysis.
7
HyperDX Unified Platform
beginnermedium
An open-source tool that correlates logs and traces automatically, making it easier to find the root cause of an error in a microservice.
8
ClickHouse for Observability
advancedhigh
Utilize ClickHouse as the storage engine for logs and traces due to its superior columnar compression and query performance for analytical data.
9
OpenTelemetry SDK Auto-Instrumentation
beginnerhigh
Use OTel zero-code instrumentation for Java, Python, or Node.js to start collecting traces without modifying your application code.
10
Pyroscope Continuous Profiling
advancedmedium
Integrate Pyroscope to continuously monitor CPU and memory usage at the function level, identifying lines of code causing performance regressions.

Core Infrastructure and Error Tracking

Sentry Error Tracking

Prometheus Time-Series Database

Grafana Visualization

Better Stack Uptime

VictoriaMetrics

Vector.dev Log Routing

Grafana Loki

Checkly Synthetic Monitoring

Netdata Real-time Monitoring

Highlight.io Session Replay

LLM Observability and Cost Management

Helicone Proxy

LangSmith Tracing

LiteLLM Proxy

Langfuse Open-Source Tracing

Arize Phoenix

Portkey AI Gateway

Promptfoo Evaluation

Weights & Biases (W&B) Prompts

Lunary LLM Stack

Parea AI

Distributed Tracing and OpenTelemetry

OpenTelemetry Collector

Jaeger Distributed Tracing

SigNoz Observability Suite

Honeycomb.io High-Cardinality Analysis

Grafana Tempo

Uptrace

HyperDX Unified Platform

ClickHouse for Observability

OpenTelemetry SDK Auto-Instrumentation

Pyroscope Continuous Profiling