@cognitive-swarm/otel
OpenTelemetry distributed tracing for cognitive-swarm. Zero overhead when no provider is configured.
Install
npm install @cognitive-swarm/otel @opentelemetry/sdk-node @opentelemetry/exporter-trace-otlp-grpcinstrumentSwarm()
The main entry point. Wraps an orchestrator with full OTel instrumentation:
import { instrumentSwarm } from '@cognitive-swarm/otel'
const swarm = new SwarmOrchestrator(config)
const instrumented = instrumentSwarm(swarm, {
agentCount: config.agents.length,
maxRounds: config.maxRounds,
})
const result = await instrumented.solve('task')
// All event types are now traced as spansReturns an InstrumentedOrchestrator:
interface InstrumentedOrchestrator {
solve(task: string): Promise<SwarmResult>
solveWithStream(task: string): AsyncIterable<SwarmEvent>
destroy(): void
/** Remove OTel subscriptions without destroying the orchestrator. */
dispose(): void
}InstrumentSwarmOptions
interface InstrumentSwarmOptions {
/** Number of agents in the swarm (recorded on the root span). */
readonly agentCount?: number
/** Max rounds configured (recorded on the root span). */
readonly maxRounds?: number
}SpanManager
Lower-level access if you need to manage spans directly. The SpanManager maintains the active span hierarchy and maps swarm events to OTel spans. Every public method is wrapped in try-catch so tracing failures never crash the swarm.
import { SpanManager } from '@cognitive-swarm/otel'
const manager = new SpanManager()
manager.startSolve('my task', 3, 5)
// ... events flow in ...
manager.endSolve(result)
manager.cleanup() // end any orphaned spansInternal span tree:
solve -> round:N -> agent:X / debate / advisor
-> tool:Y (child of round)
solve -> synthesizeComplete Setup: Jaeger + Grafana with Docker Compose
docker-compose.yaml
version: '3.8'
services:
jaeger:
image: jaegertracing/all-in-one:1.58
ports:
- '16686:16686' # Jaeger UI
- '4317:4317' # OTLP gRPC
- '4318:4318' # OTLP HTTP
environment:
COLLECTOR_OTLP_ENABLED: 'true'
grafana:
image: grafana/grafana:11.0.0
ports:
- '3100:3000'
environment:
GF_AUTH_ANONYMOUS_ENABLED: 'true'
GF_AUTH_ANONYMOUS_ORG_ROLE: Admin
volumes:
- grafana-data:/var/lib/grafana
otel-collector:
image: otel/opentelemetry-collector-contrib:0.100.0
volumes:
- ./otel-config.yaml:/etc/otelcol-contrib/config.yaml
ports:
- '4327:4317' # OTLP gRPC (external)
- '4328:4318' # OTLP HTTP (external)
volumes:
grafana-data:otel-config.yaml (for the collector)
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
exporters:
jaeger:
endpoint: jaeger:4317
tls:
insecure: true
logging:
verbosity: detailed
service:
pipelines:
traces:
receivers: [otlp]
exporters: [jaeger, logging]Application setup
import { NodeSDK } from '@opentelemetry/sdk-node'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc'
import { Resource } from '@opentelemetry/resources'
import { SEMRESATTRS_SERVICE_NAME } from '@opentelemetry/semantic-conventions'
import { instrumentSwarm } from '@cognitive-swarm/otel'
import { SwarmOrchestrator } from '@cognitive-swarm/orchestrator'
const sdk = new NodeSDK({
resource: new Resource({
[SEMRESATTRS_SERVICE_NAME]: 'cognitive-swarm',
}),
traceExporter: new OTLPTraceExporter({
url: 'http://localhost:4317',
}),
})
sdk.start()
// Instrument the swarm
const swarm = new SwarmOrchestrator(config)
const instrumented = instrumentSwarm(swarm, {
agentCount: config.agents.length,
maxRounds: config.maxRounds,
})
const result = await instrumented.solve('Analyze this architecture')
// Graceful shutdown - flush remaining spans
await sdk.shutdown()Setup with Zipkin
import { ZipkinExporter } from '@opentelemetry/exporter-zipkin'
const sdk = new NodeSDK({
traceExporter: new ZipkinExporter({
url: 'http://localhost:9411/api/v2/spans',
}),
})Integration with Cloud Providers
Datadog
npm install @opentelemetry/exporter-trace-otlp-httpimport { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'
const sdk = new NodeSDK({
resource: new Resource({
[SEMRESATTRS_SERVICE_NAME]: 'cognitive-swarm',
'deployment.environment': 'production',
}),
traceExporter: new OTLPTraceExporter({
url: 'https://trace.agent.datadoghq.com/api/v0.2/traces',
headers: { 'DD-API-KEY': process.env.DD_API_KEY! },
}),
})AWS X-Ray
npm install @opentelemetry/id-generator-aws-xray @opentelemetry/propagator-aws-xrayimport { AWSXRayIdGenerator } from '@opentelemetry/id-generator-aws-xray'
import { AWSXRayPropagator } from '@opentelemetry/propagator-aws-xray'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc'
const sdk = new NodeSDK({
resource: new Resource({
[SEMRESATTRS_SERVICE_NAME]: 'cognitive-swarm',
}),
traceExporter: new OTLPTraceExporter({
url: 'http://localhost:4317', // AWS OTel Collector sidecar
}),
idGenerator: new AWSXRayIdGenerator(),
textMapPropagator: new AWSXRayPropagator(),
})New Relic
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: 'https://otlp.nr-data.net:4317',
headers: { 'api-key': process.env.NEW_RELIC_LICENSE_KEY! },
}),
})Custom Span Attributes
All span attribute keys are exported as ATTR constants:
import { ATTR } from '@cognitive-swarm/otel'Full Attribute Reference
| Category | Key | Type | Description |
|---|---|---|---|
| Solve | swarm.task | string | Task text (truncated to 256 chars) |
swarm.agent_count | number | Number of agents | |
swarm.max_rounds | number | Max rounds configured | |
swarm.rounds_used | number | Actual rounds used | |
swarm.total_signals | number | Total signals in the log | |
swarm.consensus_reached | boolean | Whether consensus was reached | |
swarm.confidence | number | Final confidence (0-1) | |
swarm.tokens | number | Total tokens used | |
swarm.cost_usd | number | Estimated cost in USD | |
| Round | swarm.round.number | number | Round ordinal |
swarm.round.signal_count | number | Signals emitted in this round | |
| Agent | swarm.agent.id | string | Agent identifier |
swarm.agent.name | string | Agent display name | |
swarm.agent.strategy | string | Strategy used for reaction | |
swarm.agent.processing_time_ms | number | Processing time in ms | |
| Signal | swarm.signal.type | string | Signal type (proposal, vote, etc.) |
swarm.signal.id | string | Signal unique ID | |
| Tool | swarm.tool.name | string | Tool name |
swarm.tool.is_error | boolean | Whether tool call errored | |
swarm.tool.duration_ms | number | Tool execution time in ms | |
| Debate | swarm.debate.resolved | boolean | Whether debate reached resolution |
swarm.debate.rounds | number | Number of debate rounds | |
| Advisor | swarm.advisor.action_type | string | Advisor action type |
| Topology | swarm.topology.reason | string | Why topology was updated |
swarm.topology.neighbor_count | number | Number of nodes in topology |
Adding Custom Attributes to Spans
You can extend the instrumentation by subscribing to the orchestrator events alongside the built-in instrumentation:
import { trace, context } from '@opentelemetry/api'
import { instrumentSwarm } from '@cognitive-swarm/otel'
const instrumented = instrumentSwarm(swarm)
// Add custom business-level attributes
swarm.on('solve:complete', (event) => {
const activeSpan = trace.getActiveSpan()
if (activeSpan) {
activeSpan.setAttribute('app.department', 'research')
activeSpan.setAttribute('app.request_id', requestId)
activeSpan.setAttribute('app.user_tier', 'premium')
}
})Span Hierarchy
A visual representation of the full span tree for a typical solve:
solve [task="Analyze...", agents=3, maxRounds=5]
round [1] [round.number=1]
agent:on-signal [analyst] [agent.id, strategy, processing_time_ms]
agent:on-signal [critic] [agent.id, strategy, processing_time_ms]
tool:execute [web-search] [tool.name, duration_ms, is_error=false]
(event) signal:emitted [signal.id, signal.type=discovery]
(event) signal:emitted [signal.id, signal.type=proposal]
(event) signal:delivered [signal.id, agent.id]
(event) consensus:failed [failure_reason=no_majority]
round [2] [round.number=2]
debate [resolved=true, rounds=2, confidence=0.81]
(event) debate:round [round=1]
(event) debate:round [round=2]
(event) advisor:action [action_type=inject-signal]
(event) topology:updated [reason=pruned-edge]
(event) consensus:reached [decided=true, confidence=0.79]
synthesize []
(attributes on solve at end) [rounds_used=2, tokens=3200, cost_usd=0.0048]Understanding Spans vs Events
- Spans have duration (start and end time):
solve,round,agent:on-signal,tool:execute,debate,synthesize - Events are point-in-time markers attached to a parent span:
signal:emitted,signal:delivered,consensus:reached,advisor:action, etc.
Span Types
| Span | Key Attributes |
|---|---|
solve | task, solveId |
round | round.number |
signal:emitted | signal.type, signal.source, signal.confidence |
agent:reacted | agent.id, strategy.used, processing.ms |
consensus:check | consensus.strategy, consensus.decided, consensus.confidence |
math:round-analysis | entropy, normalized.entropy, information.gain |
math:stopping | stopping.reason |
advisor:action | advice.type, advice.reason |
debate:start | proposal.a, proposal.b |
debate:round | debate.round |
debate:end | debate.resolved, debate.winner, debate.confidence |
topology:updated | topology.reason |
evolution:spawned | agent.id, agent.domain, spawn.reason |
evolution:dissolved | agent.id, dissolve.reason |
synthesis:start | - |
synthesis:complete | answer.length |
solve:complete | tokens, estimated.usd, rounds.used, total.ms |
round:start | round.number |
round:end | signal.count |
checkpoint:saved | checkpoint.id |
Span Interpretation Guide
What Each Span Tells You
cognitive-swarm.solve -- the root span for the entire deliberation. Look here for:
- Total duration (was the swarm fast enough?)
swarm.tokensandswarm.cost_usd(cost monitoring)swarm.consensus_reached(did the swarm converge?)swarm.rounds_usedvsswarm.max_rounds(did it hit the limit?)
cognitive-swarm.round -- one deliberation cycle. Compare round durations to find:
- Which round took longest (slow agent? complex debate?)
signal_countper round (decreasing = convergence, increasing = divergence)
cognitive-swarm.agent.on-signal -- individual agent processing. Key diagnostics:
processing_time_msidentifies slow agents (LLM latency, complex tools)strategyshows which reasoning pattern was used- Compare across agents to find bottlenecks
cognitive-swarm.tool.execute -- external tool calls (web search, code execution). Check:
duration_msfor network latency issuesis_errorto track tool reliability- Frequency: too many tool calls may indicate poorly scoped agents
cognitive-swarm.debate -- structured conflict resolution. Indicates:
roundsused vsmaxDebateRounds(did debate converge or get cut off?)confidenceof the resolution (low = fragile consensus)
cognitive-swarm.synthesize -- final answer generation. Long synthesis spans may indicate:
- Complex answer aggregation
- Large context window being processed
Dashboard Query Examples (Jaeger UI)
Find all slow solves:
service=cognitive-swarm operation=cognitive-swarm.solve minDuration=10sFind failed consensus:
service=cognitive-swarm operation=cognitive-swarm.solve tags={"swarm.consensus_reached":"false"}Find expensive solves:
service=cognitive-swarm operation=cognitive-swarm.solve tags={"swarm.cost_usd":">0.10"}Find tool errors:
service=cognitive-swarm operation=cognitive-swarm.tool.execute tags={"swarm.tool.is_error":"true"}Performance Impact
The instrumentation is designed for minimal overhead:
| Scenario | Overhead |
|---|---|
No NodeSDK started (no provider) | ~0 (no-op tracer) |
| Provider active, Jaeger exporter | 1-3% of solve time |
| Provider active, console exporter | 5-8% (I/O bound) |
| Provider active, batch exporter (recommended for prod) | <1% |
Why Zero Overhead Without a Provider
The getTracer() function calls trace.getTracer() from @opentelemetry/api. When no TracerProvider is registered, the API returns a built-in no-op tracer. All startSpan() calls return no-op spans where setAttribute(), addEvent(), and end() are empty functions. This is the OTel API's design -- zero allocation, zero overhead.
Additionally, every method in SpanManager is wrapped in try-catch, so even if something unexpected happens in the tracing layer, it never crashes the swarm.
Custom Exporters
import { SpanExporter, ReadableSpan } from '@opentelemetry/sdk-trace-base'
import { ExportResult, ExportResultCode } from '@opentelemetry/core'
class SwarmMetricsExporter implements SpanExporter {
export(spans: ReadableSpan[], resultCallback: (result: ExportResult) => void): void {
for (const span of spans) {
if (span.name === 'cognitive-swarm.solve') {
const tokens = span.attributes['swarm.tokens'] as number
const cost = span.attributes['swarm.cost_usd'] as number
const rounds = span.attributes['swarm.rounds_used'] as number
// Push to your metrics system
metrics.recordSolve({ tokens, cost, rounds, durationMs: span.duration[1] / 1e6 })
}
}
resultCallback({ code: ExportResultCode.SUCCESS })
}
shutdown(): Promise<void> {
return Promise.resolve()
}
}
// Use with NodeSDK
const sdk = new NodeSDK({
traceExporter: new SwarmMetricsExporter(),
})InstrumentableOrchestrator
The interface required for instrumentation (structural typing -- no import needed):
interface InstrumentableOrchestrator {
solve(task: string): Promise<SwarmResult>
solveWithStream(task: string): AsyncIterable<SwarmEvent>
on<K extends keyof SwarmEventMap & string>(
event: K,
handler: (data: SwarmEventMap[K]) => void,
): () => void
destroy(): void
}Troubleshooting
No Spans Appearing in Jaeger
Check the SDK is started before instrumentation:
typescript// WRONG - SDK not started yet const instrumented = instrumentSwarm(swarm) sdk.start() // CORRECT sdk.start() const instrumented = instrumentSwarm(swarm)Verify the exporter URL: The OTLP gRPC default is
http://localhost:4317. If Jaeger runs on a different host or in Docker, adjust accordingly. From inside Docker, use the service name (e.g.,http://jaeger:4317).Flush before exit: Spans are batched. If the process exits immediately after
solve(), spans may be lost:typescriptconst result = await instrumented.solve('task') await sdk.shutdown() // flushes remaining spansCheck Jaeger is accepting OTLP: Jaeger all-in-one needs
COLLECTOR_OTLP_ENABLED=true. Without it, port 4317 is not opened.
Missing Events on Spans
Events (like signal:emitted, consensus:reached) appear in Jaeger under the "Logs" section of a span. If you see spans but no events:
- Verify the orchestrator emits events. The instrumentation subscribes to
SwarmEventMapevents viaorchestrator.on(). If your custom orchestrator does not emit these events, no events will be recorded. - Check the Jaeger UI: expand a round span and look in the "Logs" tab, not "Tags".
High Cardinality Warning
If you have many agents (50+) or long-running swarms (100+ rounds), you may see a large number of spans per trace. Mitigations:
- Use a
BatchSpanProcessorwithmaxQueueSizeandmaxExportBatchSizelimits. - Consider sampling:typescript
import { TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-base' const sdk = new NodeSDK({ sampler: new TraceIdRatioBasedSampler(0.1), // sample 10% of traces traceExporter: exporter, }) - For development, the
ParentBasedSampler(default) works well. For production with high throughput, always use ratio-based or custom sampling.
Orphaned Spans
If a solve is interrupted (timeout, crash), the SpanManager.cleanup() method ends all open spans. This is called automatically by instrumentedOrchestrator.destroy(). If you use dispose() instead, it removes event subscriptions and calls cleanup() but leaves the orchestrator alive.
Zero Overhead
When no OTel provider is configured (no NodeSDK started), all span creation is no-ops. The instrumentation layer checks for active providers before creating spans. No spans are allocated, no events are buffered, no timers are set.