Building Monitoring & Observability with open-source tools
This guide outlines the implementation of a production-grade observability stack focused on distributed tracing, LLM cost management, and error tracking. It moves beyond basic logging to provide a unified view of system health and AI pipeline performance using OpenTelemetry and specialized monitoring proxies.
Instrument Application with OpenTelemetry SDK
Initialize the OpenTelemetry SDK to capture traces and metrics automatically. This provides the foundation for distributed tracing across services. Use the OTLP exporter to send data to a vendor-neutral collector.
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("process_request"):
# Application logic here
pass⚠ Common Pitfalls
- •High overhead from excessive span creation in tight loops
- •Missing context propagation across asynchronous boundaries or message queues
- •Hardcoding exporter endpoints instead of using environment variables
Configure Error Tracking and Breadcrumbs
Integrate Sentry or a similar error tracking tool to capture unhandled exceptions and context-rich breadcrumbs. Ensure that OpenTelemetry trace IDs are attached to error reports to link logs with specific traces.
import sentry_sdk
from sentry_sdk.integrations.opentelemetry import SentrySpanProcessor
sentry_sdk.init(
dsn="YOUR_SENTRY_DSN",
traces_sample_rate=1.0,
profiles_sample_rate=1.0,
)
# Link Sentry to OpenTelemetry
provider.add_span_processor(SentrySpanProcessor())⚠ Common Pitfalls
- •Leaking PII (Personally Identifiable Information) in breadcrumbs or error messages
- •Alert fatigue caused by not grouping similar errors correctly
- •Exceeding quota limits by not setting reasonable sample rates for high-traffic endpoints
Implement LLM Observability via Proxy
Route LLM API calls through an observability proxy like Helicone or LangSmith. This allows for tracking token usage, latency, and cost without adding significant latency to the application code itself.
from openai import OpenAI
client = OpenAI(
api_key="YOUR_OPENAI_API_KEY",
base_url="https://oai.hconeai.com/v1",
default_headers={
"Helicone-Auth": f"Bearer {HELICONE_API_KEY}",
"Helicone-Cache-Enabled": "true"
}
)
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Analyze this trace data."}]
)⚠ Common Pitfalls
- •Adding a single point of failure if the proxy service goes down (implement fallbacks)
- •Increased latency if the proxy region is geographically distant from your compute
- •Inaccurate cost reporting if the proxy does not support the specific model versions used
Define SLOs and Alerting Rules
Set up Prometheus alerting rules for Service Level Objectives (SLOs). Focus on the 'Golden Signals': Latency, Traffic, Errors, and Saturation. Use Alertmanager to route critical alerts to PagerDuty or Slack while silencing non-actionable noise.
groups:
- name: api_alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.instance }}"⚠ Common Pitfalls
- •Setting alerts based on averages instead of percentiles (P95/P99)
- •Creating alerts for transient network blips that self-resolve
- •Failing to include 'runbook' links in alert annotations
Consolidate Dashboards in Grafana
Create a centralized dashboard that correlates system metrics (CPU/RAM), application metrics (request rate/latency), and business metrics (LLM cost per user). Use variables to filter by environment and service name.
⚠ Common Pitfalls
- •Overcrowding dashboards with too many graphs, making them unreadable during incidents
- •Using inconsistent time zones across different data sources
- •Querying too much data at once, leading to slow dashboard load times and high load on Prometheus/Datadog
What you built
By following this sequence, you establish a monitoring stack that covers both traditional infrastructure and modern AI-driven workloads. The key is to maintain the link between traces and errors while ensuring that LLM costs are tracked at the request level. Regularly review your alerting thresholds to minimize fatigue as your application scales.