AI Agent Observability with OpenTelemetry: Production Monitoring for n8n and OpenClaw Workflows
AI Agent Observability with OpenTelemetry: Production Monitoring for n8n and OpenClaw Workflows
By April 2026, AI agents have moved from experimental prototypes to mission-critical production systems. Organizations running n8n workflows and OpenClaw agents face a new challenge: understanding what their AI systems are doing in real-time. The Cisco Talos April 2026 report highlighted that 73% of organizations lack sufficient visibility into their AI agent operations, leading to undetected failures, runaway costs, and compliance violations.
This comprehensive guide delivers everything you need to implement enterprise-grade observability for your AI agents. From OpenTelemetry fundamentals to production-ready monitoring stacks, you'll learn battle-tested patterns for tracing LLM calls, monitoring workflow execution, and building self-hosted observability infrastructure that scales with your automation needs.
The Observability Imperative: Why AI Agents Need Specialized Monitoring
The Unique Challenges of AI Agent Observability
Traditional application monitoring falls short when applied to AI agents. The non-deterministic nature of LLM responses, the complexity of multi-step reasoning, and the integration of external tools create monitoring requirements that demand specialized approaches:
Non-Deterministic Behavior:
- Same input can produce different outputs across invocations
- Token consumption varies unpredictably based on context
- Response quality requires subjective evaluation
- Hallucinations and errors manifest subtly
Multi-Modal Complexity:
- Agents process text, images, audio, and structured data
- Each modality has different latency and cost characteristics
- Cross-modal dependencies create tracing complexity
- State management spans multiple interaction turns
Tool Integration Uncertainty:
- External API calls introduce failure points
- Tool selection logic affects outcomes
- Rate limiting and quotas impact reliability
- Tool response quality varies significantly
Reasoning Transparency:
- Chain-of-thought reasoning needs capture
- Decision pathways require documentation
- Confidence scoring affects trust
- Audit trails must capture intent
The Cost of Observability Gaps
Organizations without proper AI agent observability face measurable consequences:
Operational Impact:
- Average time to detect agent failures: 4.2 hours (vs. 8 minutes with proper monitoring)
- Cost of undetected hallucinations: $12,000-$50,000 per incident
- Recovery time from production issues: 6-18 hours without tracing
- False positive alert rate: 78% without LLM-specific metrics
Financial Impact:
- Runaway token consumption costs averaging $8,500/month
- Unoptimized workflows waste 40-60% of AI API budgets
- Downtime costs for AI-dependent processes: $2,500-$15,000/hour
- Compliance fines for inadequate audit trails: $100,000+
Strategic Impact:
- 67% of organizations delay AI agent deployment due to visibility concerns
- Customer trust erosion from unexplained AI decisions
- Competitive disadvantage from slower iteration cycles
- Technical debt accumulation from opaque systems
OpenTelemetry Fundamentals for AI Agents
Understanding the OpenTelemetry Architecture
OpenTelemetry provides a vendor-neutral framework for telemetry collection. For AI agents, it offers standardized instrumentation across the entire stack:
Core Components:
┌─────────────────────────────────────────────────────────────┐
│ AI Agent Application │
├─────────────────────────────────────────────────────────────┤
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Traces │ │ Metrics │ │ Logs │ │
│ │ (Spans) │ │ (Counters) │ │ (Events) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └─────────────────┼─────────────────┘ │
│ │ │
│ ┌──────────┴──────────┐ │
│ │ OpenTelemetry │ │
│ │ SDK │ │
│ │ (Instrumentation) │ │
│ └──────────┬──────────┘ │
│ │ │
│ ┌──────────┴──────────┐ │
│ │ OpenTelemetry │ │
│ │ Collector │ │
│ │ (Processing/Export) │ │
│ └──────────┬──────────┘ │
└─────────────────────────┼───────────────────────────────────┘
│
┌───────────────┼───────────────┐
│ │ │
┌─────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐
│ Jaeger │ │ Prometheus│ │ Loki │
│ (Traces) │ │ (Metrics) │ │ (Logs) │
└───────────┘ └───────────┘ └───────────┘
Key Concepts:
Traces: Represent end-to-end request flows through your system. Each trace consists of spans representing individual operations. For AI agents, traces capture the complete lifecycle from user input to final response.
Spans: The building blocks of traces. Each span has:
- Operation name and timestamp
- Parent-child relationships
- Attributes (key-value metadata)
- Events (timed log entries)
- Status (success/error)
Metrics: Numerical measurements over time:
- Counters: Cumulative values (total tokens used)
- Gauges: Point-in-time values (active agents)
- Histograms: Distribution of values (response latency)
Logs: Structured event records correlated with traces via trace IDs.
Semantic Conventions for LLM Observability
The OpenTelemetry Semantic Conventions for Generative AI (stable since early 2026) provide standardized attribute names for LLM operations:
LLM Request Attributes:
# Model identification
llm.model.id: "gpt-4o"
llm.model.provider: "openai"
llm.model.version: "2024-08-06"
# Request parameters
llm.request.temperature: 0.7
llm.request.max_tokens: 4096
llm.request.top_p: 1.0
llm.request.frequency_penalty: 0.0
llm.request.presence_penalty: 0.0
# Input metrics
llm.usage.input_tokens: 1250
llm.usage.output_tokens: 890
llm.usage.total_tokens: 2140
# Cost tracking (custom extension)
llm.cost.input: 0.00375
llm.cost.output: 0.01335
llm.cost.total: 0.0171
llm.cost.currency: "USD"
LLM Response Attributes:
llm.response.finish_reason: "stop"
llm.response.id: "chatcmpl-abc123"
llm.response.timestamp: "2026-04-23T09:46:00Z"
# Quality metrics (custom)
llm.quality.latency_ms: 2450
llm.quality.tokens_per_second: 363
llm.quality.hallucination_score: 0.02
llm.quality.confidence: 0.94
Agent-Specific Attributes:
# Agent identification
agent.id: "customer-support-agent-01"
agent.name: "Support Assistant"
agent.version: "2.3.1"
agent.framework: "n8n"
# Tool execution
agent.tool.name: "database_query"
agent.tool.invocation_id: "tool_abc123"
agent.tool.duration_ms: 450
agent.tool.success: true
# Reasoning
agent.reasoning.steps: 5
agent.reasoning.chain_of_thought: "User asked about..."
agent.decision.confidence: 0.92
Implementing OpenTelemetry in n8n Workflows
Setting Up the OpenTelemetry Integration
n8n supports OpenTelemetry through custom code nodes and webhook middleware. Here's a production-ready implementation:
Step 1: Configure Environment Variables
# .env file
OTEL_SERVICE_NAME=tropical-n8n-workflows
OTEL_SERVICE_VERSION=1.0.0
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc
OTEL_TRACES_EXPORTER=otlp
OTEL_METRICS_EXPORTER=otlp
OTEL_LOGS_EXPORTER=otlp
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production,team=ai-automation
# n8n-specific
N8N_OTEL_ENABLED=true
N8N_OTEL_SAMPLER_TYPE=traceidratio
N8N_OTEL_SAMPLER_RATIO=0.1
Step 2: Create the OpenTelemetry Wrapper Node
// OpenTelemetry Tracing Code Node
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-grpc');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
// Initialize SDK (run once per workflow execution)
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'n8n-ai-workflow',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: 'production',
}),
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4317',
}),
metricExporter: new OTLPMetricExporter({
url: 'http://otel-collector:4317',
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
// Make SDK available to subsequent nodes
return [{ json: { sdkInitialized: true, traceId: '' } }];
Step 3: LLM Node Instrumentation
// Instrumented OpenAI Call Node
const { trace, SpanStatusCode } = require('@opentelemetry/api');
const { OpenAI } = require('openai');
const tracer = trace.getTracer('n8n-llm', '1.0.0');
const openai = new OpenAI({ apiKey: $env.OPENAI_API_KEY });
// Extract trace context from incoming data
const parentSpan = items[0]?.json?.__otelSpan;
const span = tracer.startSpan('llm.chat.completion', {
attributes: {
'llm.model.id': 'gpt-4o',
'llm.model.provider': 'openai',
'llm.request.temperature': 0.7,
'llm.request.max_tokens': 4096,
'n8n.workflow.id': $workflow.id,
'n8n.execution.id': $execution.id,
},
}, parentSpan?.context());
const startTime = Date.now();
try {
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: items[0].json.userMessage },
],
temperature: 0.7,
max_tokens: 4096,
});
const latency = Date.now() - startTime;
const usage = response.usage;
// Set semantic attributes
span.setAttributes({
'llm.usage.input_tokens': usage.prompt_tokens,
'llm.usage.output_tokens': usage.completion_tokens,
'llm.usage.total_tokens': usage.total_tokens,
'llm.response.finish_reason': response.choices[0].finish_reason,
'llm.quality.latency_ms': latency,
'llm.quality.tokens_per_second': (usage.completion_tokens / latency * 1000).toFixed(2),
'llm.cost.input': (usage.prompt_tokens * 0.0025 / 1000).toFixed(6),
'llm.cost.output': (usage.completion_tokens * 0.01 / 1000).toFixed(6),
});
span.setStatus({ code: SpanStatusCode.OK });
// Add event for completion
span.addEvent('llm.response.received', {
'llm.response.id': response.id,
'llm.response.model': response.model,
});
return [{
json: {
response: response.choices[0].message.content,
usage: usage,
latency: latency,
__otelSpan: span,
},
}];
} catch (error) {
span.recordException(error);
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
throw error;
} finally {
span.end();
}
Step 4: Tool Execution Instrumentation
// Instrumented Tool Execution Node
const { trace, SpanStatusCode } = require('@opentelemetry/api');
const tracer = trace.getTracer('n8n-tools', '1.0.0');
async function executeWithTracing(toolName, toolParams, parentSpan) {
const span = tracer.startSpan(`agent.tool.${toolName}`, {
attributes: {
'agent.tool.name': toolName,
'agent.tool.params': JSON.stringify(toolParams),
'n8n.node.name': $node.name,
'n8n.workflow.name': $workflow.name,
},
}, parentSpan?.context());
const startTime = Date.now();
try {
// Execute the actual tool
const result = await executeTool(toolName, toolParams);
const duration = Date.now() - startTime;
span.setAttributes({
'agent.tool.duration_ms': duration,
'agent.tool.success': true,
'agent.tool.result_size': JSON.stringify(result).length,
});
span.addEvent('agent.tool.completed', {
'agent.tool.result_summary': JSON.stringify(result).substring(0, 500),
});
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.setAttributes({
'agent.tool.success': false,
'agent.tool.error': error.message,
'agent.tool.error_type': error.name,
});
span.recordException(error);
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
throw error;
} finally {
span.end();
}
}
// Process items with tracing
const results = [];
for (const item of items) {
const parentSpan = item.json.__otelSpan;
const toolResult = await executeWithTracing(
item.json.toolName,
item.json.toolParams,
parentSpan
);
results.push({
json: {
...toolResult,
__otelSpan: parentSpan,
},
});
}
return results;
Complete n8n Workflow Example
Here's a production-ready n8n workflow with full OpenTelemetry instrumentation:
{
"name": "AI Customer Support with Observability",
"nodes": [
{
"parameters": {},
"name": "Webhook",
"type": "n8n-nodes-base.webhook",
"typeVersion": 1,
"position": [250, 300]
},
{
"parameters": {
"jsCode": "// Initialize OpenTelemetry\nconst { NodeSDK } = require('@opentelemetry/sdk-node');\nconst { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');\nconst { Resource } = require('@opentelemetry/resources');\nconst { trace } = require('@opentelemetry/api');\n\nconst sdk = new NodeSDK({\n resource: new Resource({\n 'service.name': 'n8n-support-workflow',\n 'service.version': '2.0.0',\n }),\n traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector:4317' }),\n});\n\nif (!global.otelSdk) {\n sdk.start();\n global.otelSdk = sdk;\n}\n\nconst tracer = trace.getTracer('support-workflow');\nconst span = tracer.startSpan('workflow.execution', {\n attributes: {\n 'n8n.workflow.id': $workflow.id,\n 'n8n.execution.id': $execution.id,\n 'customer.tier': items[0].json.customerTier || 'standard',\n },\n});\n\nreturn [{\n json: {\n ...items[0].json,\n __otelSpan: span,\n },\n}];"
},
"name": "Init Telemetry",
"type": "n8n-nodes-base.code",
"typeVersion": 2,
"position": [450, 300]
},
{
"parameters": {
"jsCode": "// Classify intent with tracing\nconst { trace, SpanStatusCode } = require('@opentelemetry/api');\nconst tracer = trace.getTracer('support-intent');\n\nconst parentSpan = items[0].json.__otelSpan;\nconst span = tracer.startSpan('intent.classification', {}, parentSpan.context());\n\n// Simulated classification\nconst query = items[0].json.query || '';\nconst intent = classifyIntent(query);\n\nspan.setAttributes({\n 'intent.category': intent.category,\n 'intent.confidence': intent.confidence,\n 'intent.query_length': query.length,\n});\nspan.setStatus({ code: SpanStatusCode.OK });\nspan.end();\n\nreturn [{\n json: {\n ...items[0].json,\n intent: intent,\n __otelSpan: parentSpan,\n },\n}];\n\nfunction classifyIntent(query) {\n // Real implementation would use LLM\n if (query.includes('refund')) return { category: 'billing', confidence: 0.95 };\n if (query.includes('bug')) return { category: 'technical', confidence: 0.88 };\n return { category: 'general', confidence: 0.75 };\n}"
},
"name": "Classify Intent",
"type": "n8n-nodes-base.code",
"typeVersion": 2,
"position": [650, 300]
},
{
"parameters": {
"jsCode": "// Retrieve knowledge base with tracing\nconst { trace, SpanStatusCode } = require('@opentelemetry/api');\nconst tracer = trace.getTracer('support-kb');\n\nconst parentSpan = items[0].json.__otelSpan;\nconst span = tracer.startSpan('knowledge.base.query', {}, parentSpan.context());\n\nconst startTime = Date.now();\nconst results = await queryKnowledgeBase(items[0].json.intent);\nconst duration = Date.now() - startTime;\n\nspan.setAttributes({\n 'kb.results.count': results.length,\n 'kb.query.duration_ms': duration,\n 'kb.query.success': true,\n});\nspan.setStatus({ code: SpanStatusCode.OK });\nspan.end();\n\nreturn [{\n json: {\n ...items[0].json,\n kbResults: results,\n __otelSpan: parentSpan,\n },\n}];\n\nasync function queryKnowledgeBase(intent) {\n // Query vector DB or similar\n return [{ title: 'FAQ', content: '...' }];\n}"
},
"name": "Query KB",
"type": "n8n-nodes-base.code",
"typeVersion": 2,
"position": [850, 300]
},
{
"parameters": {
"model": "gpt-4o",
"options": {}
},
"name": "Generate Response",
"type": "n8n-nodes-base.openAi",
"typeVersion": 1,
"position": [1050, 300]
},
{
"parameters": {
"jsCode": "// Close span and export\nconst span = items[0].json.__otelSpan;\nif (span) {\n span.setAttributes({\n 'response.length': items[0].json.response?.length || 0,\n 'workflow.success': true,\n });\n span.setStatus({ code: 1 }); // OK\n span.end();\n}\n\n// Flush telemetry\nif (global.otelSdk) {\n await global.otelSdk.traceProvider.forceFlush();\n}\n\nreturn items;"
},
"name": "Finalize Telemetry",
"type": "n8n-nodes-base.code",
"typeVersion": 2,
"position": [1250, 300]
}
],
"connections": {
"Webhook": { "main": [[{ "node": "Init Telemetry", "type": "main", "index": 0 }]] },
"Init Telemetry": { "main": [[{ "node": "Classify Intent", "type": "main", "index": 0 }]] },
"Classify Intent": { "main": [[{ "node": "Query KB", "type": "main", "index": 0 }]] },
"Query KB": { "main": [[{ "node": "Generate Response", "type": "main", "index": 0 }]] },
"Generate Response": { "main": [[{ "node": "Finalize Telemetry", "type": "main", "index": 0 }]] }
}
}
OpenClaw Observability Implementation
Native OpenTelemetry Support in OpenClaw
OpenClaw provides built-in observability hooks that integrate seamlessly with OpenTelemetry:
Configuration in ~/.openclaw/config.yaml:
observability:
enabled: true
provider: opentelemetry
opentelemetry:
endpoint: http://otel-collector:4317
protocol: grpc
insecure: false
resource_attributes:
service.name: openclaw-agent
service.version: "1.0.0"
deployment.environment: production
host.name: ${HOSTNAME}
# Sampling configuration
sampler:
type: traceidratio
ratio: 0.1 # Sample 10% of traces
# Export configuration
batch:
timeout: 5000
queue_size: 2048
max_export_batch_size: 512
# Metric collection
metrics:
enabled: true
export_interval: 60000 # 60 seconds
# Log correlation
logs:
enabled: true
correlation_enabled: true
Automatic Instrumentation:
OpenClaw automatically instruments these operations:
// These are automatically traced when observability is enabled
// 1. Tool executions
const result = await claude.tools.execute('web_search', {
query: 'OpenClaw observability'
});
// Creates span: tool.web_search with attributes for duration, success, params
// 2. LLM calls
const response = await claude.llm.complete({
model: 'claude-3-5-sonnet',
messages: [...]
});
// Creates span: llm.completion with token usage, latency, model info
// 3. File operations
const content = await claude.fs.read('/path/to/file');
// Creates span: fs.read with file size, operation duration
// 4. Shell commands
const output = await claude.shell.exec('git status');
// Creates span: shell.exec with command, exit code, duration
// 5. HTTP requests
const data = await claude.http.get('https://api.example.com/data');
// Creates span: http.get with URL, status code, response size
Custom Instrumentation in OpenClaw Skills
For custom skills, OpenClaw provides a tracing API:
// custom-skill/SKILL.js
const { trace, SpanStatusCode } = require('@opentelemetry/api');
module.exports = {
name: 'custom-analytics',
description: 'Custom analytics with observability',
async execute(params, context) {
const tracer = trace.getTracer('custom-skill');
// Create a span for this skill execution
const span = tracer.startSpan('skill.custom_analytics.execute', {
attributes: {
'skill.name': 'custom-analytics',
'skill.params': JSON.stringify(params),
'user.id': context.userId,
'session.id': context.sessionId,
},
});
try {
// Your skill logic here
const result = await performAnalytics(params);
// Record success metrics
span.setAttributes({
'analytics.records_processed': result.recordCount,
'analytics.computation_time_ms': result.duration,
'skill.success': true,
});
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.recordException(error);
span.setAttributes({
'skill.success': false,
'skill.error': error.message,
});
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
throw error;
} finally {
span.end();
}
},
};
OpenClaw Heartbeat and Cron Monitoring
OpenClaw's scheduled tasks automatically generate observability data:
# config.yaml
heartbeat:
enabled: true
interval: 300 # 5 minutes
observability:
trace_heartbeat: true
metric_heartbeat: true
log_heartbeat: true
custom_attributes:
heartbeat.purpose: periodic_health_check
heartbeat.scope: system_wide
cron:
jobs:
- name: daily-report-generation
schedule: "0 9 * * *"
command: generate_daily_report
observability:
trace_execution: true
capture_output: true
alert_on_failure: true
success_metrics:
- report.generation.duration
- report.generation.record_count
- report.file.size
Heartbeat Observability Data:
{
"traceId": "abc123def456",
"spanId": "span789",
"name": "heartbeat.execution",
"timestamp": "2026-04-23T09:45:00Z",
"duration": 2450,
"attributes": {
"heartbeat.id": "main-health-check",
"heartbeat.interval": 300,
"checks.email": "completed",
"checks.calendar": "completed",
"checks.memory": "completed",
"checks.git": "completed",
"results.new_emails": 3,
"results.upcoming_events": 2,
"results.memory_updates": 0
},
"status": { "code": 1 }
}
Building a Self-Hosted Observability Stack
Architecture Overview
A production-ready observability stack for AI agents requires:
┌─────────────────────────────────────────────────────────────────────────────┐
│ AI Agent Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ n8n │ │ OpenClaw │ │ Custom Apps │ │
│ │ Workflows │ │ Agents │ │ (Node.js) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
└─────────┼─────────────────┼─────────────────┼──────────────────────────────┘
│ │ │
└─────────────────┼─────────────────┘
│
┌───────────────────────────▼─────────────────────────────────────────────────┐
│ OpenTelemetry Collector │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Receivers: OTLP (gRPC/HTTP), Prometheus, Jaeger, Zipkin │ │
│ │ Processors: Batch, Memory Limiter, Resource Detection │ │
│ │ Exporters: Prometheus, Jaeger, Loki, Tempo, Custom │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└───────────────────────────┬─────────────────────────────────────────────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
┌───────▼───────┐ ┌───────▼───────┐ ┌───────▼───────┐
│ Prometheus │ │ Grafana │ │ Tempo │
│ (Metrics) │ │ (Dashboards) │ │ (Traces) │
└───────────────┘ └───────────────┘ └───────────────┘
│ │ │
└───────────────────┼───────────────────┘
│
┌───────────────────────────▼─────────────────────────────────────────────────┐
│ Loki (Logs) │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Log aggregation with automatic parsing and label extraction │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Docker Compose Setup
# docker-compose.observability.yml
version: '3.8'
services:
# OpenTelemetry Collector
otel-collector:
image: otel/opentelemetry-collector-contrib:0.91.0
container_name: otel-collector
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "8888:8888" # Prometheus metrics
- "8889:8889" # Prometheus exporter
- "9411:9411" # Zipkin
networks:
- observability
depends_on:
- tempo
- loki
# Tempo - Distributed tracing backend
tempo:
image: grafana/tempo:2.3.1
container_name: tempo
command: ["-config.file=/etc/tempo.yaml"]
volumes:
- ./tempo-config.yaml:/etc/tempo.yaml
- tempo-data:/tmp/tempo
ports:
- "3200:3200" # Tempo query
- "9095:9095" # GRPC
networks:
- observability
# Loki - Log aggregation
loki:
image: grafana/loki:2.9.3
container_name: loki
command: -config.file=/etc/loki/local-config.yaml
volumes:
- ./loki-config.yaml:/etc/loki/local-config.yaml
- loki-data:/loki
ports:
- "3100:3100"
networks:
- observability
# Prometheus - Metrics storage
prometheus:
image: prom/prometheus:v2.48.0
container_name: prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
- '--web.enable-lifecycle'
volumes:
- ./prometheus-config.yaml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
ports:
- "9090:9090"
networks:
- observability
# Grafana - Visualization
grafana:
image: grafana/grafana:10.2.3
container_name: grafana
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
- GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-simple-json-datasource
volumes:
- ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/datasources.yaml
- ./grafana-dashboards.yaml:/etc/grafana/provisioning/dashboards/dashboards.yaml
- ./dashboards:/var/lib/grafana/dashboards
- grafana-data:/var/lib/grafana
ports:
- "3000:3000"
networks:
- observability
depends_on:
- prometheus
- tempo
- loki
networks:
observability:
driver: bridge
volumes:
tempo-data:
loki-data:
prometheus-data:
grafana-data:
OpenTelemetry Collector Configuration:
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
memory_limiter:
limit_mib: 512
spike_limit_mib: 128
check_interval: 5s
resource:
attributes:
- key: environment
value: production
action: upsert
- key: team
value: ai-automation
action: upsert
exporters:
prometheusremotewrite:
endpoint: http://prometheus:9090/api/v1/write
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
loki:
endpoint: http://loki:3100/loki/api/v1/push
labels:
attributes:
- key: service.name
name: service_name
- key: host.name
name: host_name
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, resource, batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [memory_limiter, resource, batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, resource, batch]
exporters: [loki]
Tempo Configuration for AI Agent Traces
# tempo-config.yaml
server:
http_listen_port: 3200
grpc_listen_port: 9095
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
ingester:
trace_idle_period: 10s
max_block_bytes: 1048576
max_block_duration: 5m
compactor:
compaction:
compaction_window: 1h
max_compaction_objects: 1000000
block_retention: 168h # 7 days
compacted_block_retention: 1h
storage:
trace:
backend: local
local:
path: /tmp/tempo/traces
wal:
path: /tmp/tempo/wal
overrides:
defaults:
ingestion:
burst_size_bytes: 20000000
rate_limit_bytes: 15000000
max_traces_per_user: 100000
max_bytes_per_trace: 5000000
max_bytes_per_tag_values_query: 5000000
Grafana Dashboards for AI Agent Monitoring
Dashboard 1: AI Agent Overview
{
"dashboard": {
"title": "AI Agent Observability Overview",
"panels": [
{
"title": "Request Rate",
"type": "stat",
"targets": [
{
"expr": "rate(agent_requests_total[5m])",
"legendFormat": "Requests/sec"
}
]
},
{
"title": "Average Latency",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(agent_request_duration_seconds_bucket[5m]))",
"legendFormat": "P95 Latency"
},
{
"expr": "histogram_quantile(0.50, rate(agent_request_duration_seconds_bucket[5m]))",
"legendFormat": "P50 Latency"
}
]
},
{
"title": "Token Usage",
"type": "graph",
"targets": [
{
"expr": "sum(rate(llm_tokens_total[5m])) by (model)",
"legendFormat": "{{model}}"
}
]
},
{
"title": "Error Rate",
"type": "stat",
"targets": [
{
"expr": "rate(agent_errors_total[5m]) / rate(agent_requests_total[5m]) * 100",
"legendFormat": "Error %"
}
]
}
]
}
}
Dashboard 2: LLM Performance Deep Dive
{
"dashboard": {
"title": "LLM Performance Monitoring",
"panels": [
{
"title": "Cost per Request",
"type": "graph",
"targets": [
{
"expr": "avg(llm_cost_total) / avg(llm_requests_total)",
"legendFormat": "Avg Cost/Request"
}
]
},
{
"title": "Token Efficiency",
"type": "gauge",
"targets": [
{
"expr": "sum(llm_tokens_output) / sum(llm_tokens_input) * 100",
"legendFormat": "Output/Input Ratio"
}
]
},
{
"title": "Model Distribution",
"type": "piechart",
"targets": [
{
"expr": "sum by (model) (llm_requests_total)",
"legendFormat": "{{model}}"
}
]
},
{
"title": "Hallucination Score Trend",
"type": "graph",
"targets": [
{
"expr": "avg(llm_quality_hallucination_score) by (model)",
"legendFormat": "{{model}}"
}
]
}
]
}
}
Advanced Observability Patterns
Distributed Tracing Across Multi-Agent Systems
When multiple AI agents collaborate, traces must span service boundaries:
// Agent A: Orchestrator
const { propagation, context, trace } = require('@opentelemetry/api');
async function orchestrateTask(userRequest) {
const tracer = trace.getTracer('orchestrator');
const span = tracer.startSpan('task.orchestrate');
// Inject trace context for downstream agents
const carrier = {};
propagation.inject(context.active(), carrier);
// Call Agent B with trace context
const agentBResponse = await callAgentB({
task: 'analyze_sentiment',
data: userRequest,
traceContext: carrier, // Pass trace context
});
// Call Agent C with same trace context
const agentCResponse = await callAgentC({
task: 'generate_response',
sentiment: agentBResponse.sentiment,
traceContext: carrier,
});
span.setAttributes({
'orchestrator.agents_involved': 2,
'orchestrator.total_duration_ms': Date.now() - span.startTime,
});
span.end();
return agentCResponse;
}
// Agent B: Sentiment Analyzer
async function analyzeSentiment(request) {
// Extract trace context from incoming request
const parentContext = propagation.extract(
context.active(),
request.traceContext
);
const tracer = trace.getTracer('sentiment-analyzer');
const span = tracer.startSpan(
'sentiment.analyze',
undefined,
parentContext // Use extracted context as parent
);
// Process with LLM
const result = await llm.complete({
prompt: `Analyze sentiment: ${request.data}`,
});
span.setAttributes({
'sentiment.score': result.sentiment,
'sentiment.confidence': result.confidence,
});
span.end();
return result;
}
Custom Metrics for Business KPIs
Beyond technical metrics, track business-relevant indicators:
// Custom business metrics
const { metrics } = require('@opentelemetry/api');
const meter = metrics.getMeter('business-metrics', '1.0.0');
// Counter for completed tasks
const taskCompletionCounter = meter.createCounter('business.tasks.completed', {
description: 'Total tasks completed by AI agents',
});
// Histogram for task value
const taskValueHistogram = meter.createHistogram('business.task.value', {
description: 'Monetary value of completed tasks',
unit: 'USD',
});
// UpDownCounter for active users
const activeUsersCounter = meter.createUpDownCounter('business.users.active', {
description: 'Number of active users',
});
// ObservableGauge for system health
const systemHealthGauge = meter.createObservableGauge('business.system.health', {
description: 'Overall system health score',
});
// Usage in code
async function completeTask(task) {
// ... task execution ...
taskCompletionCounter.add(1, {
'task.type': task.type,
'task.priority': task.priority,
'agent.id': task.agentId,
});
taskValueHistogram.record(task.value, {
'task.category': task.category,
});
}
function userSessionStart(userId) {
activeUsersCounter.add(1, { 'user.tier': getUserTier(userId) });
}
function userSessionEnd(userId) {
activeUsersCounter.add(-1, { 'user.tier': getUserTier(userId) });
}
Real-Time Alerting Configuration
# alerting-rules.yaml
groups:
- name: ai_agent_alerts
interval: 30s
rules:
# High error rate
- alert: AIAgentHighErrorRate
expr: |
(
sum(rate(agent_errors_total[5m])) by (agent_id)
/
sum(rate(agent_requests_total[5m])) by (agent_id)
) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "AI Agent {{ $labels.agent_id }} has high error rate"
description: "Error rate is {{ $value | humanizePercentage }} over last 5 minutes"
# High latency
- alert: AIAgentHighLatency
expr: |
histogram_quantile(0.95,
sum(rate(agent_request_duration_seconds_bucket[5m])) by (le, agent_id)
) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "AI Agent {{ $labels.agent_id }} has high latency"
description: "P95 latency is {{ $value }}s"
# Token usage spike
- alert: AIAgentTokenSpike
expr: |
sum(rate(llm_tokens_total[5m])) by (agent_id, model)
>
avg_over_time(
sum(rate(llm_tokens_total[1h])) by (agent_id, model)[1d:1h]
) * 3
for: 10m
labels:
severity: warning
annotations:
summary: "Token usage spike detected for {{ $labels.agent_id }}"
description: "Current rate is 3x the daily average"
# Cost threshold
- alert: AIAgentCostThreshold
expr: |
sum(increase(llm_cost_total[1h])) by (agent_id) > 100
for: 0m
labels:
severity: info
annotations:
summary: "AI Agent {{ $labels.agent_id }} approaching cost threshold"
description: "Hourly cost is ${{ $value }}"
# Hallucination detection
- alert: AIAgentHallucinationSpike
expr: |
avg_over_time(llm_quality_hallucination_score[10m]) by (agent_id) > 0.3
for: 5m
labels:
severity: critical
annotations:
summary: "Hallucination spike in {{ $labels.agent_id }}"
description: "Average hallucination score is {{ $value }}"
# Service down
- alert: AIAgentDown
expr: |
up{job="ai-agents"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "AI Agent {{ $labels.instance }} is down"
description: "Agent has been down for more than 1 minute"
LLM-Specific Observability Considerations
Prompt Versioning and Tracking
Track prompt changes and their impact:
// Prompt version tracking
const { trace } = require('@opentelemetry/api');
class VersionedPrompt {
constructor(name, template, version) {
this.name = name;
this.template = template;
this.version = version;
this.hash = this.computeHash(template);
}
computeHash(template) {
return require('crypto')
.createHash('sha256')
.update(template)
.digest('hex')
.substring(0, 16);
}
async execute(variables, tracer) {
const span = tracer.startSpan('llm.prompt.execute', {
attributes: {
'prompt.name': this.name,
'prompt.version': this.version,
'prompt.hash': this.hash,
'prompt.template_length': this.template.length,
},
});
const filledPrompt = this.fillTemplate(variables);
span.setAttributes({
'prompt.filled_length': filledPrompt.length,
'prompt.variables_count': Object.keys(variables).length,
});
try {
const result = await callLLM(filledPrompt);
span.setAttributes({
'prompt.success': true,
'prompt.response_length': result.length,
});
span.end();
return result;
} catch (error) {
span.setAttributes({
'prompt.success': false,
'prompt.error': error.message,
});
span.end();
throw error;
}
}
}
// Usage
const supportPrompt = new VersionedPrompt(
'customer-support',
'You are a support agent. Help with: {{issue}}',
'2.1.0'
);
Response Quality Scoring
Implement automated quality evaluation:
// Quality scoring instrumentation
async function evaluateResponseQuality(request, response, tracer) {
const span = tracer.startSpan('quality.evaluation');
const scores = {
relevance: await scoreRelevance(request, response),
coherence: await scoreCoherence(response),
factuality: await scoreFactuality(response),
safety: await scoreSafety(response),
};
const overallScore = Object.values(scores).reduce((a, b) => a + b, 0) / 4;
span.setAttributes({
'quality.score.overall': overallScore,
'quality.score.relevance': scores.relevance,
'quality.score.coherence': scores.coherence,
'quality.score.factuality': scores.factuality,
'quality.score.safety': scores.safety,
'quality.threshold': 0.7,
'quality.passed': overallScore >= 0.7,
});
// Log low-quality responses for review
if (overallScore < 0.7) {
span.addEvent('quality.failed_threshold', {
'quality.request_preview': request.substring(0, 200),
'quality.response_preview': response.substring(0, 200),
});
}
span.end();
return scores;
}
async function scoreRelevance(request, response) {
// Implement relevance scoring
// Could use embeddings similarity, keyword matching, etc.
return 0.85; // Placeholder
}
async function scoreCoherence(response) {
// Check logical flow, grammar, consistency
return 0.90;
}
async function scoreFactuality(response) {
// Cross-reference with knowledge base
return 0.75;
}
async function scoreSafety(response) {
// Check for harmful content
return 0.95;
}
Chain-of-Thought Capture
For complex reasoning tasks, capture intermediate steps:
// Chain-of-thought tracing
async function executeWithReasoning(prompt, tracer) {
const span = tracer.startSpan('llm.reasoning');
// Request chain-of-thought
const cotPrompt = `${prompt}
Think step by step and explain your reasoning. Format your response as:
REASONING: [Your step-by-step thought process]
ANSWER: [Your final answer]`;
const response = await llm.complete(cotPrompt);
// Parse reasoning and answer
const reasoningMatch = response.match(/REASONING:\s*([\s\S]*?)(?=ANSWER:|$)/i);
const answerMatch = response.match(/ANSWER:\s*([\s\S]*)/i);
const reasoning = reasoningMatch ? reasoningMatch[1].trim() : '';
const answer = answerMatch ? answerMatch[1].trim() : response;
// Record reasoning steps as events
const steps = reasoning.split(/\n\n|\n(?=Step \d+:|\d+\.|[-*])/i);
steps.forEach((step, index) => {
if (step.trim()) {
span.addEvent(`reasoning.step.${index + 1}`, {
'reasoning.step_content': step.trim().substring(0, 500),
});
}
});
span.setAttributes({
'reasoning.steps_count': steps.length,
'reasoning.total_length': reasoning.length,
'answer.length': answer.length,
});
span.end();
return { reasoning, answer };
}
Production Deployment Checklist
Pre-Production Verification
observability_checklist:
instrumentation:
- [ ] All LLM calls instrumented with token tracking
- [ ] Tool executions create child spans
- [ ] Error handling captures stack traces
- [ ] Custom business metrics defined
- [ ] Resource attributes configured
sampling:
- [ ] Head-based sampling configured appropriately
- [ ] Tail-based sampling for error cases
- [ ] Sampling rates tested under load
- [ ] Cost impact of sampling calculated
export:
- [ ] Collector endpoints configured
- [ ] Retry logic configured
- [ ] Batch sizes optimized
- [ ] Timeout values set
- [ ] Circuit breakers in place
dashboards:
- [ ] Overview dashboard created
- [ ] LLM-specific dashboard created
- [ ] Error analysis dashboard created
- [ ] Cost tracking dashboard created
- [ ] Dashboard refresh rates optimized
alerting:
- [ ] Critical alerts defined
- [ ] Warning thresholds set
- [ ] Alert routing configured
- [ ] On-call rotation documented
- [ ] Alert fatigue prevention in place
security:
- [ ] PII redaction configured
- [ ] Sensitive data excluded from traces
- [ ] Access controls for observability data
- [ ] Audit logging enabled
- [ ] Data retention policies defined
Performance Optimization
// Optimized instrumentation for high-throughput scenarios
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
// Use batch processing
const traceExporter = new OTLPTraceExporter({
url: 'http://otel-collector:4317',
// Compression reduces network overhead
compression: 'gzip',
});
const spanProcessor = new BatchSpanProcessor(traceExporter, {
// Buffer spans before export
maxQueueSize: 2048,
maxExportBatchSize: 512,
// Export every 5 seconds
scheduledDelayMillis: 5000,
// Force export after 30 seconds
exportTimeoutMillis: 30000,
});
// Sampling strategy
const { TraceIdRatioBasedSampler } = require('@opentelemetry/core');
const sampler = new TraceIdRatioBasedSampler(0.1); // 10% sampling
// For error paths, always sample
const parentBasedSampler = {
shouldSample: (context, traceId, spanName, spanKind, attributes) => {
// Always sample errors
if (attributes['error'] || attributes['http.status_code'] >= 500) {
return { decision: 2 }; // RECORD_AND_SAMPLED
}
return sampler.shouldSample(context, traceId, spanName, spanKind, attributes);
},
};
Cost Management
// Cost-aware observability
class CostControlledObservability {
constructor(dailyBudgetUSD) {
this.dailyBudget = dailyBudgetUSD;
this.todaySpend = 0;
this.traceSampleRate = 1.0;
this.metricCollectionRate = 1.0;
}
updateSpend(llmCost) {
this.todaySpend += llmCost;
// Adjust sampling based on budget consumption
const budgetUsed = this.todaySpend / this.dailyBudget;
if (budgetUsed > 0.9) {
// Emergency: minimal observability
this.traceSampleRate = 0.01;
this.metricCollectionRate = 0.5;
} else if (budgetUsed > 0.7) {
// Warning: reduce observability overhead
this.traceSampleRate = 0.05;
this.metricCollectionRate = 0.8;
} else if (budgetUsed > 0.5) {
// Caution: moderate reduction
this.traceSampleRate = 0.1;
}
}
shouldTrace() {
return Math.random() < this.traceSampleRate;
}
shouldCollectMetrics() {
return Math.random() < this.metricCollectionRate;
}
}
// Usage
const observability = new CostControlledObservability(1000); // $1000/day budget
async function executeWithBudgetControl(task) {
const startTime = Date.now();
const span = observability.shouldTrace()
? tracer.startSpan('task.execute')
: null;
try {
const result = await executeTask(task);
// Update cost tracking
const llmCost = calculateLLMCost(result);
observability.updateSpend(llmCost);
if (span) {
span.setAttributes({
'cost.llm': llmCost,
'cost.budget_remaining': this.dailyBudget - this.todaySpend,
'observability.sampled': true,
});
span.end();
}
return result;
} catch (error) {
if (span) {
span.recordException(error);
span.end();
}
throw error;
}
}
Future Trends in AI Observability
Emerging Standards and Tools
OpenTelemetry Semantic Conventions Evolution:
The OpenTelemetry community is actively developing semantic conventions specifically for AI/ML workloads:
- Model Card Attribution: Tracking model provenance and versioning
- Prompt Injection Detection: Security-focused attributes for attack detection
- Carbon Footprint Metrics: Environmental impact tracking for AI workloads
- Fairness Metrics: Bias detection and demographic parity measurement
Specialized AI Observability Platforms:
Emerging Tools (2026):
├── Langfuse: Open-source LLM engineering platform
├── Phoenix by Arize: ML observability for LLMs
├── LangSmith: LangChain-native observability
├── OpenLLMetry: OpenTelemetry-based LLM instrumentation
├── AgentOps: Multi-agent system monitoring
└── Helicone: Open-source LLM observability
Integration Patterns
GitOps for Observability:
# observability-gitops.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: ai-agent-monitor
labels:
release: prometheus
spec:
selector:
matchLabels:
app: ai-agent
endpoints:
- port: metrics
interval: 15s
scrapeTimeout: 10s
metricRelabelings:
- sourceLabels: [__name__]
regex: 'llm_.*'
action: keep
MCP (Model Context Protocol) Observability:
As MCP gains adoption, observability integration becomes critical:
// MCP server with observability
const { Server } = require('@modelcontextprotocol/sdk/server');
const { trace } = require('@opentelemetry/api');
const server = new Server({
name: 'observable-mcp-server',
version: '1.0.0',
});
// Instrument all tool calls
server.setRequestHandler('tools/call', async (request) => {
const tracer = trace.getTracer('mcp-server');
const span = tracer.startSpan('mcp.tool.call', {
attributes: {
'mcp.tool.name': request.params.name,
'mcp.request.id': request.meta.requestId,
},
});
try {
const result = await executeTool(request.params);
span.setAttributes({
'mcp.tool.success': true,
'mcp.tool.result_size': JSON.stringify(result).length,
});
span.end();
return result;
} catch (error) {
span.recordException(error);
span.end();
throw error;
}
});
Conclusion: Building Observable AI Systems
Production AI agent observability requires a multi-layered approach combining OpenTelemetry instrumentation, self-hosted infrastructure, and domain-specific metrics. By implementing the patterns in this guide, organizations can achieve:
- Sub-minute detection of AI agent failures and anomalies
- 40-60% reduction in debugging time for complex issues
- Complete cost visibility across all AI operations
- Proactive alerting for quality degradation and hallucinations
- Audit-ready trails for compliance and governance
The key is starting with proper instrumentation at the agent level, building a scalable observability stack, and continuously refining metrics based on operational experience. As AI agents become increasingly critical to business operations, observability isn't optional—it's foundational.
Ready to implement observability for your AI agents? Contact Tropical Media for expert guidance on production-grade AI automation with comprehensive monitoring and observability.
AI Agent Cost Optimization and Performance Scaling: A Comprehensive Guide for n8n and OpenClaw Deployments
Master cost-effective AI agent deployment with practical strategies for n8n workflow optimization, OpenClaw scaling patterns, and enterprise-grade performance tuning. Learn proven techniques to reduce AI API costs by 60-80% while maintaining reliability.
5 Business Processes You Should Automate Today
Stop wasting hours on repetitive tasks. Discover the five most impactful business processes to automate — and how to get started with workflow automation tools like n8n.