Guides

Building Monitoring & Observability with open-source tools

This guide outlines the implementation of a production-grade observability stack focused on distributed tracing, LLM cost management, and error tracking. It moves beyond basic logging to provide a unified view of system health and AI pipeline performance using OpenTelemetry and specialized monitoring proxies.

4-6 hours5 steps

Instrument Application with OpenTelemetry SDK

Initialize the OpenTelemetry SDK to capture traces and metrics automatically. This provides the foundation for distributed tracing across services. Use the OTLP exporter to send data to a vendor-neutral collector.

tracing_setup.py

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("process_request"):
    # Application logic here
    pass

⚠ Common Pitfalls

•High overhead from excessive span creation in tight loops
•Missing context propagation across asynchronous boundaries or message queues
•Hardcoding exporter endpoints instead of using environment variables

Configure Error Tracking and Breadcrumbs

Integrate Sentry or a similar error tracking tool to capture unhandled exceptions and context-rich breadcrumbs. Ensure that OpenTelemetry trace IDs are attached to error reports to link logs with specific traces.

sentry_config.py

import sentry_sdk
from sentry_sdk.integrations.opentelemetry import SentrySpanProcessor

sentry_sdk.init(
    dsn="YOUR_SENTRY_DSN",
    traces_sample_rate=1.0,
    profiles_sample_rate=1.0,
)

# Link Sentry to OpenTelemetry
provider.add_span_processor(SentrySpanProcessor())

⚠ Common Pitfalls

•Leaking PII (Personally Identifiable Information) in breadcrumbs or error messages
•Alert fatigue caused by not grouping similar errors correctly
•Exceeding quota limits by not setting reasonable sample rates for high-traffic endpoints

Implement LLM Observability via Proxy

Route LLM API calls through an observability proxy like Helicone or LangSmith. This allows for tracking token usage, latency, and cost without adding significant latency to the application code itself.

llm_monitoring.py

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_OPENAI_API_KEY",
    base_url="https://oai.hconeai.com/v1",
    default_headers={
        "Helicone-Auth": f"Bearer {HELICONE_API_KEY}",
        "Helicone-Cache-Enabled": "true"
    }
)

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Analyze this trace data."}]
)

⚠ Common Pitfalls

•Adding a single point of failure if the proxy service goes down (implement fallbacks)
•Increased latency if the proxy region is geographically distant from your compute
•Inaccurate cost reporting if the proxy does not support the specific model versions used

Define SLOs and Alerting Rules

Set up Prometheus alerting rules for Service Level Objectives (SLOs). Focus on the 'Golden Signals': Latency, Traffic, Errors, and Saturation. Use Alertmanager to route critical alerts to PagerDuty or Slack while silencing non-actionable noise.

prometheus_rules.yml

groups:
- name: api_alerts
  rules:
  - alert: HighErrorRate
    expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "High error rate on {{ $labels.instance }}"

⚠ Common Pitfalls

•Setting alerts based on averages instead of percentiles (P95/P99)
•Creating alerts for transient network blips that self-resolve
•Failing to include 'runbook' links in alert annotations

Consolidate Dashboards in Grafana

Create a centralized dashboard that correlates system metrics (CPU/RAM), application metrics (request rate/latency), and business metrics (LLM cost per user). Use variables to filter by environment and service name.

⚠ Common Pitfalls

•Overcrowding dashboards with too many graphs, making them unreadable during incidents
•Using inconsistent time zones across different data sources
•Querying too much data at once, leading to slow dashboard load times and high load on Prometheus/Datadog

What you built

By following this sequence, you establish a monitoring stack that covers both traditional infrastructure and modern AI-driven workloads. The key is to maintain the link between traces and errors while ensuring that LLM costs are tracked at the request level. Regularly review your alerting thresholds to minimize fatigue as your application scales.