Closing the Token Traceability Gap: Architecting a Self-Hosted LLM Observability Pipeline
Learn to build a privacy-first LLM observability pipeline using OpenTelemetry and ClickHouse. Monitor tokens, costs, and latency without third-party SaaS lock-in.
The deployment of Large Language Models (LLMs) into production environments has moved at breakneck speed. Yet, for many CTOs and engineering leads, a distinct feeling of unease settles in shortly after the "Hello World" phase. While the AI is generating value, it is also operating as a black box. We call this the Token Traceability Gap.
Unlike traditional microservices, where CPU and memory spikes tell the story, LLMs introduce new, opaque metrics: prompt token usage, completion token costs, hallucination rates, and non-deterministic latency. Relying on third-party observability SaaS platforms often leads to ballooning costs or, worse, sending sensitive prompt data to yet another external vendor.
At Nohatek, we believe in owning your data infrastructure. In this deep dive, we outline how to architect a robust, self-hosted observability pipeline using OpenTelemetry (OTel) for standardization and ClickHouse for high-performance analytical storage. This approach offers granular visibility, data sovereignty, and significant cost savings.
The New Metrics of AI: Why Traditional APM Fails
If you attempt to monitor a Retrieval-Augmented Generation (RAG) pipeline using standard Application Performance Monitoring (APM) tools, you will miss the forest for the trees. Traditional APM is built for deterministic systems; LLMs are probabilistic.
To truly understand your AI infrastructure, you need to capture specific telemetry that standard HTTP tracing ignores:
- Token Economics: Tracking the exact count of input vs. output tokens per request to calculate real-time cost attribution.
- chain-of-Thought Latency: Identifying which step in a LangChain or LlamaIndex sequence is causing the bottleneck—is it the vector database retrieval or the LLM generation?
- Model Drift & Quality: capturing the actual inputs and outputs to score them later for relevance or toxicity.
"You cannot optimize what you cannot measure. In the world of LLMs, a query that takes 5 seconds and costs $0.01 looks identical to a query that takes 5 seconds and costs $0.10 in standard APM logs. That difference is your margin."
The solution requires a system that treats logs, traces, and metrics as first-class citizens but allows for the high-cardinality analysis required to debug complex AI interactions.
The Architecture: OpenTelemetry meets ClickHouse
Why this specific combination? The answer lies in standardization and compression.
OpenTelemetry (OTel) has emerged as the undisputed industry standard for observability. Recently, OTel released semantic conventions specifically for GenAI, allowing developers to instrument LLM calls uniformly regardless of whether they are using OpenAI, Anthropic, or a self-hosted Llama 3 model. By using the OTel SDK, we avoid vendor lock-in at the code level.
ClickHouse is the muscle behind the operation. LLM traces can be incredibly verbose, especially if you are logging full prompts and completions for debugging. Storing this data in a traditional row-based database (like PostgreSQL) or an expensive log aggregator (like Splunk or Datadog) becomes cost-prohibitive at scale.
ClickHouse is a columnar OLAP database designed for immutable data streams. It offers:
- Insane Compression: often achieving 10x compression ratios on log data, drastically reducing storage costs.
- Blazing Fast Aggregations: allowing you to query "Average token cost per user for the last month" across millions of rows in milliseconds.
The Pipeline Flow:
- Application Layer: Your Python/Node.js service uses the OTel SDK to capture traces.
- OTel Collector: A lightweight binary that receives traces, batches them, and processes them (redacting sensitive PII).
- ClickHouse Exporter: The collector pushes the processed batches into ClickHouse tables.
- Visualization: Grafana connects to ClickHouse to visualize the data.
Implementation Strategy: From Code to Dashboard
Let’s look at how this materializes in a real-world scenario. We will focus on a Python application using the OpenAI library.
Step 1: Instrumentation
Instead of writing custom logging wrappers, we utilize the OpenTelemetry auto-instrumentation libraries. This captures the HTTP calls to the LLM provider automatically, populating attributes like gen_ai.usage.input_tokens.
from opentelemetry import trace
from opentelemetry.instrumentation.openai import OpenAIInstrumentor
# Initialize OTel Tracer
tracer = trace.get_tracer(__name__)
# Auto-instrument OpenAI
OpenAIInstrumentor().instrument()Step 2: The ClickHouse Schema
To store these traces effectively, we utilize ClickHouse's MergeTree engine. A simplified schema for storing spans might look like this:
CREATE TABLE otel_traces (
Timestamp DateTime64(9),
TraceId String,
SpanId String,
ParentSpanId String,
ServiceName String,
SpanName String,
Duration Int64,
StatusCode String,
ResourceAttributes Map(String, String),
SpanAttributes Map(String, String)
)
ENGINE = MergeTree()
ORDER BY (ServiceName, Timestamp);Step 3: The Visualization
With data flowing into ClickHouse, you can use Grafana to build queries that answer critical business questions. For example, to calculate the total cost of tokens (assuming a hypothetical cost per 1k tokens), you can write SQL directly against your traces:
This setup provides a Self-Hosted LLM Control Plane. You can see exactly how a change in your system prompt affects latency, or how a switch from GPT-4 to GPT-3.5 impacts the length of user conversations.
Most importantly, the data stays with you. For industries like healthcare or finance, where sending full prompt logs to a SaaS observability tool is a compliance violation, this architecture is not just a preference—it is a requirement.
The "Token Traceability Gap" is a temporary problem solved by mature engineering. By architecting a pipeline with OpenTelemetry and ClickHouse, you move from guessing about your AI's performance to engineering it with precision.
This architecture delivers the trifecta of modern infrastructure: Visibility, Sovereignty, and Cost Efficiency. You aren't just monitoring an API; you are building an asset that understands your business logic.
Need help architecting your AI infrastructure? At Nohatek, we specialize in building high-performance, secure cloud environments for the next generation of intelligent applications. Whether you need a custom observability pipeline or a full-scale cloud migration, our team is ready to assist.