Behavioral AI - @takk/behavioralai v1.0.0 - Apache-2.0

Know your agents' normal. Catch the drift first.

Your IM agent passes every test and still degrades in production: tool calls quietly start failing, retrieval thins out, latency and cost creep. Behavioral AI learns each agent's behavioral fingerprint, flags what is abnormal now, names the feature that moved, and forecasts what crosses the line next, before the visible failure.

200tests passing
94%line coverage
0runtime deps
SLSAprovenance
What it is

One engine, two answers.

In a few words

You call observe() once per completed agent turn with numbers you already have: latency, tokens, cost, tool calls, finish reason. Behavioral AI learns what normal looks like for that agent (50 observations by default), then flags deviations, names the features that moved, projects when a trend crosses critical, and posts alerts to Slack, PagerDuty, or 11 other destinations. observe() is synchronous and never does I/O, so it cannot slow your agent down.

Technically

A behavioral observability engine keeping a per-unit statistical fingerprint: Welford and EWMA baselines plus P-square quantiles per numeric feature. Detection layers robust z-scores against the EWMA baseline, exact one-sided binomial tail tests for rate features, two-sided Page-Hinkley with exponential forgetting for sustained shifts, and bias-corrected Jensen-Shannon divergence for categorical mixes. Findings need 2-evaluation confirmation, any open finding freezes the baseline, attribution ranks the contributing features, and least-squares forecasting reports time-to-critical. Alert governance adds cooldown, a severity floor, and canary mode.

Before and after

The same incident, two timelines.

Take a real production scene: a dependency update subtly changes a tool's response format, and your support agent starts silently retrying. No exception, no error spike, just a slow ramp.

Without Behavioral AI

Quiet degradation, loud discovery

  1. Day 1 A provider-side change lands. The agent's tool failure rate starts creeping above its usual level.
  2. Day 2 Retries push latency and cost up a few percent. Every individual trace still looks plausible.
  3. Day 3 The tracer dashboard shows nothing red: there is no incident, only a shifting distribution.
  4. Day 5 A customer escalation arrives. On-call starts reading traces by hand.
  5. Day 6 Root cause found. The drift started five days earlier; nobody was watching the distribution.
With Behavioral AI

Caught while it is still a trend

  1. turn 31 baseline.ready: the engine has learned this agent's normal across every observed feature.
  2. turn 96 Drift is injected: latency and tool failures begin a slow ramp, far below any obvious threshold.
  3. turn 113 forecast.detected feature=latencyMs: projected to cross its critical threshold in about 0.78 h if the trend continues, while every current value is still in range.
  4. turn 139 drift.detected feature=latencyMs severity=warning, behavior score 94.
  5. turn 140 drift.detected feature=toolFailureRate severity=warning, behavior score 90.
  6. The alert lands in Slack with the attribution and the behavior score. The forecast arrived 17 turns after onset and 26 turns before the first hard detection: days before a human would have noticed.

The "With Behavioral AI" timeline is the literal output of behavioralai simulate --turns 160 --warmup 30 --seed 7. The simulation is fully deterministic: identical output on every run.

Install

Five minutes from install to first fingerprint.

1. Add the package

pnpm add @takk/behavioralai
npm install @takk/behavioralai
yarn add @takk/behavioralai
bun add @takk/behavioralai

Zero runtime dependencies, on every entry. Nothing else to install.

2. Observe your first turns

import { createBehavioralAI } from '@takk/behavioralai';

const radar = createBehavioralAI({
  sensitivity: 'balanced',
  warmup: { minObservations: 50 },
});

// One call per completed agent turn. Synchronous, no I/O.
const report = radar.observe({
  agentId: 'support-agent',
  latencyMs: 842,
  costUsd: 0.0021,
  inputTokens: 1200,
  outputTokens: 310,
  contextTokens: 6100,
  toolCalls: [{ name: 'web_search', ok: true, latencyMs: 130 }],
  finishReason: 'stop',
});

console.log(report.status);        // 'learning' until warmup completes
console.log(report.behaviorScore); // 100 at baseline, drops under drift

3. Wire alert channels

import { createBehavioralAI } from '@takk/behavioralai';
import { pagerdutyChannel, slackChannel } from '@takk/behavioralai/channels';

const radar = createBehavioralAI({
  channels: [
    slackChannel({ webhookUrl: process.env.SLACK_WEBHOOK_URL ?? '' }),
    pagerdutyChannel({ routingKey: process.env.PAGERDUTY_ROUTING_KEY ?? '' }),
  ],
  alerts: { cooldownMs: 300_000, minSeverity: 'warning', canary: false },
});

Delivery is asynchronous; a channel failure becomes alert.failed telemetry and can never crash the agent.

4. Optional: observe the @takk siblings

import { keymeshBridge, modelchainBridge } from '@takk/behavioralai/integrations';

// Fingerprint every routed model: modelchain:<modelId> profiles.
const stopRouter = modelchainBridge(router, radar, { perModel: true });

// Fingerprint the credential pool: keymesh:<keyId> profiles.
const stopPool = keymeshBridge(pool, radar, { perKey: true });

@takk/keymesh >= 1.0.0 and @takk/modelchain >= 1.0.0 are optional peers. The bridges use structural typing, importing nothing from the peers, and their type compatibility is proven in CI against the real published 1.0.0 declarations. modelchainAlertSummarizer can additionally append a model-written two-sentence incident summary to every alert.

Features

Nine capabilities, one job: know abnormal before it is visible.

Per-agent behavioral fingerprinting

Every observed unit earns its own learned normal: agents, skills, gateways, MCP servers, tools, models, or credential pools, named by convention (skill:summarize, gateway:openrouter, mcp:filesystem).

No global thresholds. An agent that is normally slow never pages you for being itself.

Zero initial configuration

No metrics to define, no thresholds to tune. The engine learns from 50 observations by default (configurable) and starts evaluating. Three presets: strict, balanced, relaxed.

You switch it on, it learns, it alerts. Tuning is a preset name, not a spreadsheet.

Multi-dimensional features

12 numeric features plus 2 categorical ones: tool selection patterns, turn distributions, retrieval profile, context signal-to-noise, error and tool-failure rates, not just latency and cost.

Drift in how an agent works is caught even while latency and cost still look fine.

Predictive alerts

A least-squares trend over the recent window projects time-to-critical in observations and hours within a 24 h horizon; forecast alerts fire while every current value is still in range. A forecast requires a significant trend (|slope| at least 4 standard errors) and is clamped to the feature domain, so stationary traffic stays quiet: at most 2 forecast events in 2000 turns in the benchmark.

"This will cross critical in N hours if the trend continues" arrives before the incident, not after it.

OpenTelemetry as input

turnFromSpan and observeSpan map GenAI semantic-convention spans (gen_ai.* attributes) straight into observations; tool spans become tool:<name> profiles.

Complements tracers instead of replacing them: the same spans answer a different question.

Attribution and behavior score

Every finding ranks the contributing features with direction, observed vs expected values, and a one-line summary each; every turn returns an EWMA-smoothed 0 to 100 behavior score that counts warning-level deviations only.

Alerts name the cause. On-call starts at the feature that moved, not at a wall of traces.

13 alert destinations

Slack, Discord, Teams, Google Chat, Telegram, PagerDuty, Notion, Reddit, X, Google Sheets, Google Docs, generic webhook, and SMTP email. All fetch-based and dependency-free.

Alerts land where the team already lives, without installing a single vendor SDK.

Universal and tiny

Dual ESM + CJS across 7 entries plus the CLI. Node >= 20, browsers, and edge runtimes; the core is 8.88 kB brotli with zero runtime dependencies.

The same engine runs in a Cloudflare Worker and in your Node fleet.

Content-free by contract

The engine never sees prompt or completion text; ingestion is numbers, category labels, and identifiers. Snapshots hold aggregate statistics only: never credentials, never content.

Behavioral monitoring you can defend in a privacy review: there is nothing sensitive to leak.

Detection

Four detectors, one confirmation discipline.

Each detector answers a different failure shape; the discipline around them keeps the alert channel quiet until something is real.

Detector Watches What it catches
Robust z-score numeric features Point anomalies against the recency-weighted EWMA baseline, with the standard deviation floored by the long-run Welford estimate.
Exact binomial tail test errorRate, toolFailureRate Rate shifts, scoring the window count with an exact one-sided binomial tail test, no normal approximation, alerting on the harmful direction only.
Page-Hinkley (two-sided) numeric features Sustained sub-threshold shifts, around 1.5 to 3 sigma, that never trip a point threshold, confirmed over time with exponential forgetting; a confirmed shift opens a finding immediately and re-arms.
Jensen-Shannon divergence (bias-corrected) toolSelection, finishReason Changed tool or finish-reason mixes, as a bounded 0 to 1 distance between the recent window and the learned baseline, with the finite-sample bias (k-1)/(4 n ln 2) subtracted.

The discipline around them

Rule Behavior
Confirmation A finding opens only after 2 consecutive out-of-range evaluations; one-sample outliers never open findings. The only immediate path is a Page-Hinkley confirmed sustained shift, which has already integrated evidence across many turns and re-arms after it fires.
Recovery A drifted feature recovers after 5 consecutive evaluations comfortably back in range, below 0.7x the warning threshold, and emits a recovery alert (severity info).
Frozen baseline While any finding is open, warning included, the baseline is frozen: anomalous turns cannot poison the learned normal the incident is being measured against.
absorb() Accepts the recent window as the new normal, per feature or for the whole agent: baselines rebuilt, drift states reset, frozen features unfrozen.
Behavior score An EWMA-smoothed 0 to 100 score that counts warning-level deviations only: healthy agents read a steady 100 (the benchmark bounds the healthy 5th percentile at 99 or above), and real drift pulls it down within 1 to 2 evaluations.

Sensitivity presets

Preset warning z critical z warn JSD crit JSD EWMA alpha PH delta PH lambda
strict 2.5 3.5 0.07 0.18 0.08 0.01 50
balanced 3 4.5 0.10 0.25 0.05 0.02 75
relaxed 4 6 0.16 0.38 0.03 0.04 110

Pass sensitivity: '<preset>' or a partial SensitivityConfig merged over balanced when one number really must differ. The engine also takes maxAgents (default 1000) as a cardinality guard: observations for agents beyond the cap are ignored and surface as an error telemetry event.

Channels

Thirteen destinations, zero dependencies.

Every channel is fetch-based and universal, except email, which ships a minimal SMTP client for Node. Every credential accepts a TokenSource: a string, a function, or an async function, so secret managers plug in without glue code.

Destination Factory Entry Wire format and auth
Slack slackChannel /channels Incoming webhook.
Discord discordChannel /channels Webhook.
Microsoft Teams teamsChannel /channels Adaptive Card payload.
Google Chat googleChatChannel /channels Webhook.
Telegram telegramChannel /channels Bot token plus chat id.
PagerDuty pagerdutyChannel /channels Events API v2.
Generic webhook webhookChannel /channels JSON POST to any URL.
Notion notionChannel /channels One database page per alert.
Reddit redditChannel /channels Script-app OAuth2.
X xChannel /channels OAuth2 bearer, or full OAuth 1.0a HMAC-SHA1 signed via WebCrypto.
Google Sheets googleSheetsChannel /channels Row append; service-account auth built in.
Google Docs googleDocsChannel /channels Document append; same service-account auth.
Email emailChannel /smtp Minimal SMTP client: STARTTLS, implicit TLS, AUTH LOGIN, dot-stuffing. Node-only.

Google service-account RS256 JWT signing (googleAccessToken) is built in with token caching: no Google SDK anywhere. Channels never throw from send(); outcomes come back as alert.dispatched or alert.failed telemetry.

CLI

One binary, four commands.

help, simulate, inspect, and serve. The serve command is the bridge for Python-first stacks such as Hermes Agent: any process that can POST JSON gets a behavioral fingerprint.

Reproduce a detection, deterministically

npx @takk/behavioralai simulate --turns 160 --warmup 30 --seed 7

simulation: turns=160 warmup=30 drift-at=96 seed=7
turn 31 baseline.ready agent=sim-agent
turn 113 forecast.detected feature=latencyMs latencyMs is trending up and is projected to cross its critical threshold (1438.64) in about 0.78 h (94 observations) if the trend continues
turn 139 drift.detected feature=latencyMs severity=warning score=1.01 behavior=94
turn 140 drift.detected feature=toolFailureRate severity=warning score=3.10 behavior=90
...
--- summary ---
turns: 160
drift injected at turn: 96
first detection: turn 139 (delay 43 turns)
final behavior score: 74

Same seed, same output, every run. The transcript above is real engine output, not marketing copy. Note the order: the forecast fired at turn 113, 17 turns after injection and 26 turns before the first hard detection at turn 139. Catch the drift before the failure, demonstrated by the product's own deterministic demo.

Run the ingestion server

npx @takk/behavioralai serve --port 8787 --host 127.0.0.1 \
  --state .behavioralai/state.json --slack "$SLACK_WEBHOOK_URL"

# from the agent side: one observation per turn (single or array)
curl -X POST http://127.0.0.1:8787/observe \
  -H 'Content-Type: application/json' \
  -d '{"agentId":"hermes-main","latencyMs":1240,"outputTokens":380}'

Binds 127.0.0.1 by default, caps request bodies at 1 MB, and also exposes GET /inspect and GET /healthz. Pass --token <secret> to require Authorization: Bearer on every endpoint except /healthz: recommended whenever the server binds beyond localhost. Alerts go to --slack or --webhook URLs.

Inspect learned fingerprints

# live, while serve is running
curl http://127.0.0.1:8787/inspect

# or offline, from a persisted state file
npx @takk/behavioralai inspect --state .behavioralai/state.json
OpenTelemetry and Hermes Agent

Spans in, fingerprints out.

If your stack already exports OpenTelemetry GenAI semantic-convention spans, Behavioral AI needs no new instrumentation: feed the serialized spans to the mapper and every gen_ai.* producer becomes a behavioral profile.

import { createBehavioralAI } from '@takk/behavioralai';
import { observeSpan, turnFromSpan } from '@takk/behavioralai/otel';

const radar = createBehavioralAI();

// In your OTLP pipeline worker:
for (const span of batch.spans) observeSpan(radar, span);

// Or map manually when you want to inspect or amend the turn first.
const turn = turnFromSpan(chatSpan);

Chat spans map token usage, latency, cost, and finish reasons; tool-execution spans become their own tool:<name> profiles. Spans exported through the community hermes-otel plugin for Hermes Agent map directly.

Hermes v0.13 Tenacity added zombie detection and heartbeat monitoring for its Kanban workers. Behavioral AI extends that instinct to the whole stack: skill behavior fingerprinting, gateway pattern analysis, and MCP server health profiling, by naming convention alone (skill:summarize, gateway:openrouter, mcp:filesystem). For Python-first deployments without OTel, the serve bridge does the same job over plain HTTP.

Compare

Behavioral AI next to the tools you already run.

The other tools in this space solve adjacent problems well. The contrast clarifies where Behavioral AI sits: it is the watchdog layer, not another tracer.

Capability Behavioral AI Tracers LLM metric dashboards Hand-rolled checks
Core question what is abnormal now, what crosses next what happened in this trace how metrics aggregate over time whatever you encoded
Learns each agent's normal yes, per-agent fingerprint no no, static thresholds rarely
Initial configuration none: switch on, it learns instrument, then browse define metrics and alerts weeks of tuning
Predictive time-to-critical yes no generic forecasting no
Sees prompt content never yes, stores it depends varies
Runs in-process yes, observe() p99 < 1 ms SDK + backend service agent + SaaS yes
Runtime dependencies 0 many n/a (hosted) varies
OTel GenAI spans as input yes produces and stores them partial n/a
License Apache-2.0 varies commercial your call

The honest summary: keep your tracer. Braintrust, Langfuse, LangSmith, LangWatch, Helicone, and Datadog LLM Observability are excellent at showing what happened. Behavioral AI watches the same stream and raises what is abnormal now and what crosses the line next. Same spans in, a different question answered.

Signals

The exact features the engine fingerprints.

Only the dimensions you provide are fingerprinted, with one exception: errorRate is always extracted (absent error means 0), so silent failure onset is always observable. If a number or label is not listed here, the engine never sees it.

Feature Kind What it captures
latencyMs numeric End-to-end turn latency.
costUsd numeric Spend per turn.
inputTokens numeric Prompt-side token volume.
outputTokens numeric Completion-side token volume.
totalTokens numeric, derived Combined token volume per turn.
contextTokens numeric Context-window load.
contextSnr numeric, derived Output tokens per context token: context signal-to-noise.
retrievalChunks numeric Retrieval profile per turn.
toolCallCount numeric Tool usage volume.
toolFailureRate rate (binomial tail) Share of failing tool calls.
turnIndex numeric Position within the task: task-length distribution.
errorRate rate (binomial tail), always on Turn-level error onset.
toolSelection categorical (JSD) Which tools the agent picks, one sample per call.
finishReason categorical (JSD) How turns end.

Everything arrives in one TurnObservation: agentId plus optional timestamp, latencyMs, costUsd, inputTokens, outputTokens, contextTokens, retrievalChunks, toolCalls, turnIndex, taskId, finishReason, error, metadata. Fifteen telemetry kinds report everything the engine does, from observation.recorded to state.persisted.

Quality and validation

The receipts behind v1.0.0.

Tests & coverage

201 tests passing across 14 suites under Vitest, including a labeled detection-quality benchmark: 7 deterministic scenarios covering a stationary control, sustained 2.5-sigma and 3.2-sigma shifts, an abrupt 6-sigma regression, an error-rate spike with an anti-poisoning bound, a finish-reason mix shift, and a forecast-before-critical ramp. The mechanism tests prove the math, the benchmark proves detection quality, and both run in CI. Coverage: 94.4% lines, 92.88% statements, 95.51% functions, 85.08% branches, with enforced thresholds of 80/80/80/60. Run pnpm test on a fresh clone to reproduce.

Type safety

TypeScript in maximum strict mode, zero errors. attw green for all 8 entry conditions: dual ESM + CJS with separate .d.ts and .d.cts for all 7 library entries plus the behavioralai CLI bin.

Lint & packaging

Biome clean. publint clean. Built with tsup on pnpm@10.34.1; Node >= 20 with a CI matrix across Node 20, 22, and 24.

Deterministic validation

The CLI simulation is fully deterministic: simulate --turns 160 --warmup 30 --seed 7 prints baseline ready at turn 31, drift injected at turn 96, a forecast at turn 113, and first detection at turn 139, identical on every run. Sibling-integration types are proven in CI against the real published @takk/keymesh and @takk/modelchain 1.0.0 declarations.

CLI end-to-end

Subprocess tests cover the whole binary surface: help, unknown commands, deterministic simulate, inspect, and the serve HTTP surface. The SMTP channel is tested against a scripted local server, including STARTTLS and failure stages.

Supply chain

Committed pnpm lockfile and SLSA provenance attestation on every published version. Verify with npm view @takk/behavioralai@1.0.0 --json | jq .dist.attestations.

Bundle sizes (brotli)

Entry Size
@takk/behavioralai (core) 8.88 kB ESM / 9.04 kB CJS
/otel 805 B
/channels 3.26 kB
/smtp 2.05 kB
/integrations 744 B
/web 8.26 kB
/edge 8.26 kB
Roadmap

What is shipped, what is next, what is later.

Now (1.0)

Shipped in v1.0.0

  • Per-agent fingerprints: 12 numeric + 2 categorical features
  • Four detectors with confirmation, freeze, recovery, absorb
  • Trend forecasting with time-to-critical
  • Attribution and the 0 to 100 behavior score
  • 13 alert channels, enrichers, alert governance
  • OTel GenAI ingestion and the serve HTTP bridge
  • Memory and file state, dual ESM + CJS, SLSA provenance
Next (1.1)

Targeted for 1.1

  • Multivariate deviation with estimated covariance
  • Bayesian online changepoint detection as an optional detector
  • Redis/KV state backend for shared baselines across replicas
  • OTLP/HTTP receiver mode for behavioralai serve
  • Per-agent sensitivity overrides and per-feature mute lists
  • Alert templates per channel
Later

On the horizon

  • Compliance pack: continuous behavioral monitoring as evidence for SOC 2, ISO 42001, and EU AI Act workflows
  • Published observe() throughput benchmarks backing the SPEC SLOs
  • Middleware examples for Mastra and the Vercel AI SDK
  • Managed baselines as a possible cloud layer; the open core stays local-first
FAQ

Common questions.

Is Behavioral AI production-ready at 1.0.0?

Yes. 201 tests across 14 suites pass under Vitest with 94.4 percent line coverage; TypeScript maximum strict mode, Biome lint, publint, and attw on all 8 entry conditions are all clean. Every published release carries SLSA provenance produced by GitHub Actions. The deterministic CLI simulation reproduces a full detection end to end: baseline ready at turn 31, drift injected at turn 96, a forecast at turn 113, first detection at turn 139, identical on every run.

How is Behavioral AI different from tracers like Langfuse, LangSmith, or Braintrust?

Tracers record what happened: every span, prompt, and completion, searchable after the fact. Behavioral AI learns each agent's normal and tells you what is abnormal now and what crosses the line next, before a visible failure. It consumes the same OpenTelemetry GenAI spans your tracer already produces, so it complements the tracer instead of replacing it.

Do I have to configure metrics or thresholds?

No. You switch it on, it learns, it alerts. The engine fingerprints every feature you provide, with a 50-observation warmup by default. Three sensitivity presets (strict, balanced, relaxed) replace threshold tuning, and any preset value can be overridden when you really want to.

Does Behavioral AI see my prompts or completions?

Never. The ingestion contract is content-free: numbers, category labels, and caller-chosen identifiers. The engine never receives prompt or completion text at all, and the persisted StateSnapshot holds aggregate statistics only: never credentials, never content.

Can a failing alert channel break my agent?

No. observe() is synchronous and never performs I/O; alert delivery runs asynchronously and surfaces outcomes as telemetry. A channel failure becomes an alert.failed event, and no failure of a channel, enricher, or state backend can ever propagate an exception into the observed agent's call path.

Does this run in Cloudflare Workers, Vercel Edge, Deno, Bun, or browsers?

Yes. The /web and /edge entries ship the core surface for browsers and edge runtimes, and the channels, otel, and integrations entries are universal. Only the file state backend and the SMTP email channel are Node-only, and both load their builtins lazily.

How do I monitor a Python agent such as Hermes Agent?

Run behavioralai serve next to it. The CLI starts an HTTP ingestion server with POST /observe, GET /inspect, and GET /healthz, bound to 127.0.0.1 by default; your agent posts one small JSON observation per turn. If you already export OpenTelemetry GenAI spans, for example through the community hermes-otel plugin, feed them to turnFromSpan or observeSpan instead.

What happens when an agent's behavior changes on purpose?

Call absorb(agentId, feature?). The baseline is rebuilt from the recent window and the new behavior becomes the new normal. Until you do, the baseline stays frozen while any finding is open, warning included, so the incident cannot retrain the learned normal, and a drifted feature recovers on its own after 5 consecutive evaluations comfortably back in range, below 0.7 times the warning threshold.

How do false positives stay under control?

A finding opens only after 2 consecutive out-of-range evaluations, so one-sample outliers never open findings. The only immediate path is a Page-Hinkley confirmed sustained shift, where the detector has already integrated evidence across many turns. Alert delivery adds a per-agent cooldown (5 minutes by default) that only a higher severity bypasses, and canary mode lets you evaluate everything while delivering nothing during tuning.

What is the policy on breaking changes?

Strict SemVer 2.0.0, starting from 1.0.0. The binding stability surface is documented in SPEC.md section 5: every export of the seven library entries, the CLI surface, the StateSnapshot v1 schema, the telemetry kinds, and the channel wire formats. Major bumps require a deprecation cycle; security fixes follow the disclosure flow in SECURITY.md.

Author

Built and maintained by David C Cavalcante.

David C Cavalcante

Founder, Takk Innovate Studio

Product Engineer, ML Engineer, LLM Architect, and researcher. Builder of the @takk family of NPM packages: infrastructure for Massive Intelligence (IM) systems and the non-human entities (NHE) they run.

Behavioral AI is the watchdog layer of the @takk portfolio, the third published package after @takk/keymesh and @takk/modelchain: route models with modelchain, govern credentials with keymesh, and observe both with Behavioral AI through the built-in bridges. Adjacent research by the author covers systemic intelligence frameworks (MAIC, HIM, NHE) published independently of this codebase, with research notes on PhilPapers and PhilArchive linked from the repository README.

If Behavioral AI caught a drift before your users did, the most useful thing you can do is open a GitHub issue when you find an edge case the 201 tests missed. The release runbook, the threat model, and the contributor agreement all live in the repository.