Guides

Building Error tracking and alerting setup with Sentry an...

This guide provides a structured approach to implementing a monitoring and observability stack for backend systems, focusing on production reliability, LLM API tracking, and distributed tracing. It emphasizes actionable steps for setting up metrics, alerts, and distributed tracing with open-source and commercial tools.

2-3 hours7 steps

Define monitoring scope and critical metrics

Identify key performance indicators (KPIs) for your system. Focus on error rates (e.g., 5xx responses), latency thresholds (e.g., P99 < 200ms), and LLM API costs. Use tools like Prometheus to define scrape targets for services and APIs.

prometheus.yml

scrape_configs:
  - job_name: 'my-service'
    static_configs:
      - targets: ['localhost:9090']

⚠ Common Pitfalls

•Over-monitoring without clear alerting criteria
•Ignoring business-critical metrics in favor of technical ones

Set up Prometheus for service metrics

Deploy Prometheus to collect metrics from your backend services. Configure service discovery for dynamic environments or static targets for monolithic applications. Use the Prometheus Query Language (PromQL) to create initial alerting rules.

prometheus.yml

alerting:
  rules:
  - alert: HighRequestLatency
    expr: job:request_latency_seconds:99percentile{job="my-service"} > 0.2

⚠ Common Pitfalls

•Incorrect scrape intervals causing data gaps
•Not securing Prometheus endpoints in production

Integrate Grafana for visualization

Add Prometheus as a data source in Grafana and create dashboards for error rates, latency, and LLM API usage. Use pre-built templates for common metrics (e.g., CPU, memory, request counts) and customize them for your service.

⚠ Common Pitfalls

•Overloading dashboards with non-actionable metrics
•Not using Grafana's built-in alerting for critical thresholds

Implement OpenTelemetry for distributed tracing

Instrument your services with OpenTelemetry to collect traces. Configure the OpenTelemetry Collector to export traces to a backend like Jaeger or Zipkin. Add trace context propagation between microservices.

instrumentation.py

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPTraceExporter
from opentelemetry.sdk.trace import TracerProvider

trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_exporter(OTLPTraceExporter(endpoint="http://otel-collector:4317"))

⚠ Common Pitfalls

•Incorrect service name configuration in traces
•Not sampling enough traces for meaningful analysis

Configure alerting with Alertmanager

Set up Alertmanager to handle alert routing and deduplication. Define notification channels (e.g., Slack, email) and configure silences for maintenance periods. Test alerting rules with synthetic traffic.

alertmanager.yml

route:
  receiver: 'slack-notifications'
  group_by: ['job']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
receivers:
- name: 'slack-notifications'
  slack_configs:
  - api_url: 'https://hooks.slack.com/...'

⚠ Common Pitfalls

•Alert fatigue from overly broad rules
•Not testing alert silences during deployments

Monitor LLM API costs and latency

Integrate LLM API metrics (e.g., token usage, request rate) using tools like Helicone or LangSmith. Create dashboards to track costs per model and detect anomalies in latency or error rates.

llm_request.sh

curl -X POST https://api.helicone.ai/v1/chat/completions \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "Hello"}]}'

⚠ Common Pitfalls

•Missing custom metrics for vendor-specific API limits
•Not correlating LLM errors with upstream service failures

Validate distributed tracing across AI pipelines

Verify that traces span multiple services in your AI pipeline (e.g., API gateway → model inference → database). Use trace ID correlation to debug latency bottlenecks and error propagation.

⚠ Common Pitfalls

•Traces not being exported due to network policies
•Missing context propagation between asynchronous components

What you built

This guide establishes a foundation for monitoring production systems with actionable metrics, alerting, and distributed tracing. Continuously refine alert thresholds, expand instrumentation to new services, and validate observability pipelines under load.