@takk/automend - v1.0.0 - Apache-2.0

Catch the agent failure before your users do.

Your agent invents a product id that does not exist. The output looks plausible, so nobody notices for three days, and by then customers have the problem. AutoMend scores the output, detects the failure mode (ungrounded claim, drift, loop, corruption), decides a recovery, and acts, all in the hot path, all recorded in a tamper-evident audit trail.

112tests passing
93%line coverage
0runtime deps
SLSAprovenance
What it is

One engine, two answers.

In a few words

You give AutoMend the signals you already have about an agent's output: a confidence number, claims classified against your evidence, behavioral metrics, a tool result. It scores them, decides whether something is wrong, and if it is, picks a recovery (rollback, retry with adjusted guardrails, escalate, or ask a human) and runs the callback you provided. Every decision lands in a tamper-evident audit log. Your agent code does not change shape: you wrap a step, you do not rewrite it.

Technically

A deterministic policy and scoring engine that runs a detect, decide, act loop: weighted confidence scoring, GSAR-style typed grounding (four-way claims, asymmetric contradiction penalty, three-tier decision), a Welford drift baseline with z-score detection, heuristic loop and corruption detectors, an ordered recovery policy with safe mode and a retry budget, and a SHA-256 hash-chain audit seal over the Web Crypto API. It never calls a model. Zero runtime dependencies, node-free core.

Before and after

The same silent failure, two timelines.

Take a real production scene: a customer-facing agent answers an order question and confidently cites a product id that does not exist in your catalog. The text reads perfectly.

Without AutoMend

A plausible lie, discovered three days later

  1. Day 0, 09:14 The agent returns a fabricated product id. The output looks correct.
  2. Day 0 The answer ships to the customer. No alarm, the response was well-formed.
  3. Day 1 A handful of customers act on the wrong id. Support tickets trickle in.
  4. Day 2 The tickets get linked. Someone suspects the agent.
  5. Day 3 An engineer reproduces it, finds the hallucination, and starts a manual audit.
  6. Day 3 There is no record of which other answers were affected. The audit is guesswork.
With AutoMend

Caught in the hot path, recovered and recorded

  1. 09:14:01 The claim is classified against your catalog evidence: contradicted.
  2. 09:14:01 Typed grounding scores the output, the tier is replan.
  3. 09:14:01 AutoMend raises a high-severity issue. Safe mode declines to auto-act.
  4. 09:14:01 The recovery decision is escalate; your executor opens a review task.
  5. 09:14:01 The detection, the decision, and the outcome are appended to the audit log.
  6. End of run The log is sealed with a SHA-256 root, verifiable later, intact.
  7. The wrong id never reached the customer. The incident is one entry, not a three-day hunt.

AutoMend does not judge truth on its own. The contradicted classification comes from the evidence you feed it (a catalog lookup, a matcher, or a model); AutoMend turns that signal into a deterministic, audited recovery decision.

Install

Five minutes from install to first recovery.

1. Add the package

pnpm add @takk/automend
npm install @takk/automend
yarn add @takk/automend
bun add @takk/automend

2. Guard an agent step

Feed AutoMend the signals you already measure. It scores them, decides a recovery, and runs your executor.

import { createAutoMend } from '@takk/automend';

              const automend = createAutoMend();

              const report = await automend.guard({
              confidence: [{ name: 'self-reported', value: 0.32 }],
              claims: [
              { id: 'c1', type: 'grounded', evidenceType:
              'observed' },
              { id: 'c2', type: 'contradicted',
              evidenceType: 'observed' },
              ],
              executors: {
              retry: () => regenerateWithGuardrails(),
              escalate: () => openReviewTask(),
              },
              });

              if (!report.healthy) {
              console.log(report.decision?.strategy); // e.g.
                'escalate'
              }

3. Detect without a model

The heuristic detectors turn raw output into classifications, so the loop works before you wire in a model.

import { classifyClaims, detectLoop, loopIssue } from '@takk/automend/detectors';

              // lexical grounding matcher: text + evidence -> classified claims
              const claims = classifyClaims(
              [{ id: 'c1', text: 'the order shipped on
                monday' }],
              ['observation: the order shipped on monday from the warehouse'],
              );

              // loop / recursion detector over step fingerprints
              const loop = detectLoop(['search', 'search', 'search']);

              await automend.guard({ issues: [loopIssue(loop)].filter(Boolean), executors });

4. Seal and verify the audit trail

import { createAuditLog } from '@takk/automend/audit';

              const log = createAuditLog({ id: 'run-42' });
              log.append('detection', 'low confidence', { score: 0.31 });
              log.append('decision', 'escalate');

              const seal = await log.seal(); // { algorithm: 'sha-256', root: '...', count: 2
                }
              (await log.verify(seal)).valid; // true; false if any entry is altered
Features

Nine capabilities, one deterministic loop.

Confidence scoring

Aggregate the signals you already measure (self-reported uncertainty, evidence coverage, schema validity) into one weighted score in [0, 1] with a per-signal breakdown and a coarse verdict.

A single, explainable number to threshold on, instead of scattered ad-hoc checks.

Typed grounding (GSAR)

The GSAR scoring core: four-way claim typology, evidence-strength weights, an asymmetric contradiction penalty, and a three-tier decision (proceed, regenerate, replan).

A principled groundedness score that punishes contradictions harder than it rewards support, from a published method.

Drift monitor

A per-metric baseline maintained with Welford's online algorithm, flagging any observation beyond a z-score threshold, with an explicit insufficient-data state.

Catch a slow behavioral shift before it becomes an incident, with no history to store.

Recovery orchestrator

An ordered policy (first matching rule wins) maps an issue to a strategy: proceed, retry, rollback, escalate, or ask a human. Safe mode and a retry budget keep it from auto-acting on serious failures.

Recovery decisions are deterministic and repeatable, not a tangle of inline if-statements.

Heuristic detectors

Model-free starters: a lexical grounding matcher turns text plus evidence into classified claims, a loop detector flags repetition, a corruption detector flags malformed output.

The loop works out of the box; swap any detector for a real model when you need more.

Human escalation

An immutable, content-addressed escalation record carries the issue, the decision, and full context the moment automatic recovery is not safe.

When AutoMend pauses and asks, the human gets everything they need in one object.

Tamper-evident audit

Every detection, decision, and outcome is appended to an audit log and sealed with a SHA-256 hash chain over the Web Crypto API. Verify later that nothing was altered.

The immutable execution record EU AI Act Article 12 asks of high-risk systems.

Framework-agnostic and edge

Wrap any step with interceptors, bridge a failed MCP tool call into recovery, and run the node-free core in Node, Cloudflare Workers, Vercel Edge, Deno, Bun, and the browser.

One reliability layer for every agent, wherever it runs, with no framework lock-in.

Zero-dependency, SLSA provenance

Zero required runtime dependencies and a core around 4.2 kB brotli. Every published version is signed with npm publish --provenance through GitHub Actions OIDC.

Verify in one command that the tarball you installed was built from the source commit you trust.

Recovery strategies

Five strategies, decided by an ordered policy.

The policy maps an issue to a strategy, first matching rule wins. You supply the executor for each; AutoMend decides which one runs.

Strategy What it does When the default policy picks it
proceed Accept the output and continue. No issue was raised, or your policy explicitly allows the issue.
retry Run your retry executor, typically with adjusted guardrails. Low confidence, ungrounded output, or a tool error, until the retry budget is exhausted.
rollback Run your rollback executor to undo the step. A contradicted claim or a detected loop.
escalate Hand off to your escalation executor with full context. High or critical severity in safe mode, an exhausted retry, or the fallback.
ask-human Pause and request a human decision. Your policy routes an issue to a human in the loop.

Safe mode (on by default) turns any automatic strategy into an escalation for high and critical issues, and the retry budget converts an exhausted retry into an escalation, so the loop never runs away.

Entry points

Import the whole engine, or just one piece.

Entry point What it exports
@takk/automend createAutoMend().guard() plus the full re-exported surface.
@takk/automend/confidence Weighted confidence scoring and verdicts.
@takk/automend/grounding GSAR typed grounding and the three-tier decision.
@takk/automend/detectors Lexical grounding matcher, loop detector, corruption detector.
@takk/automend/drift Welford baseline and z-score drift detection.
@takk/automend/recovery Ordered recovery policy, decide and run.
@takk/automend/escalation Immutable escalation record builder.
@takk/automend/audit Append-only audit log and SHA-256 seal.
@takk/automend/interceptors guardStep, deterministic clock, issue adapters.
@takk/automend/mcp MCP tool-error to recovery bridge.
@takk/automend/edge The full node-free core for edge runtimes and the browser.
CLI

Score, assess, and verify from the terminal.

The automend binary runs the same deterministic engine over JSON files, with sysexits exit codes (0 ok, 1 verify failed, 64 usage, 65 bad data, 66 unreadable input).

Score and assess

# score confidence signals (JSON array or { signals: [...] })
              npx automend score signals.json

              # assess typed grounding for claims (JSON array or { claims: [...] })
              npx automend assess claims.json

Inspect and verify an audit log

# summarize an audit log
              npx automend inspect audit-log.json

              # verify a log against its seal (exit 1 if the seal does not match)
              npx automend verify audit-log.json audit-seal.json
Typed results

Every result is a typed object you can branch on.

No string parsing, no untyped events. guard returns a GuardReport; the audit log returns typed entries and a seal. Branch on report.healthy and report.decision.strategy with full narrowing in TypeScript.

const report = await automend.guard({ /* signals */ });

              // GuardReport
                // {
                //   healthy: false,
                //   issues: [{ kind: 'contradicted', severity: 'high', detail: '...' }],
                //   grounding: { score: 0, raw: -0.5, tier: 'replan', counts: {...}, total: 2 },
                //   decision: { strategy: 'escalate', reason: 'safe mode: ...', attempt: 0, issue: {...} },
                //   recovery: { executed: true, result: ... },
                // }

The sealed audit log

const seal = await automend.audit?.seal();
              // AuditSeal
                // { algorithm: 'sha-256', root: 'a3f1...64 hex chars', count: 4 }

              const result = await automend.audit?.verify(seal);
              // VerifyResult
                // { valid: true } or { valid: false, brokenAt: 2 }

The seal root is a 64-character hex SHA-256 digest of the entry hash chain. Any altered, added, or removed entry flips valid to false on the next verify.

Compare

AutoMend vs the alternatives.

The other tools in this space solve adjacent problems well. The contrast clarifies where AutoMend sits.

Capability AutoMend Post-hoc eval Guardrail libs Hand-rolled
Category real-time recovery offline analysis inline blocking your code
When it runs in-line, real time after the run in-line in-line
Acts on failure (recovery) orchestrated analysis only blocks, no recovery varies
Tamper-evident audit SHA-256 seal hosted logs varies no
Calls a model no, deterministic yes, judges sometimes varies
Runtime dependencies 0 hosted SDK varies varies
Runs on the edge yes, node-free no varies varies
SaaS lock-in none usually varies none
License Apache-2.0 commercial / OSS mix varies your call

The honest summary: keep your post-hoc evals for offline analysis and your guardrails for blocking. AutoMend is the in-line, deterministic decision and audit layer that turns their signals into a recovery action in the hot path. They compose; AutoMend does not replace your model-based judges.

What it does not do

The honest limits of v1.0.0.

AutoMend is a deterministic policy and scoring engine, not a model. These are the boundaries, stated plainly, with the workaround or the roadmap item for each.

Limit What it means Workaround / roadmap
It is not a model AutoMend scores the classifications you supply. The built-in detectors are lexical heuristics; they do not perform natural-language inference. Classify with a real NLI model or matcher and feed AutoMend the result.
Determinism is by policy Same inputs, same decision. AutoMend cannot make a sampled model deterministic; it makes the recovery decision deterministic. By design. Pair it with record and replay if you need a deterministic run.
The seal proves integrity, not identity The SHA-256 chain shows an audit log was not altered after sealing. It does not prove who produced it. Roadmap 1.1: optional Ed25519 signing on top of the seal.
Detection is opt-in AutoMend scores what you route through it. An output you never instrument is never checked. Wrap every step you care about with guard or guardStep.
Drift needs a baseline Below the minimum sample count, a metric is reported insufficient-data and is never flagged. Feed enough observations before relying on drift detection.
Streams are captured as resolved values A token stream is scored as its final value, not chunk by chunk. Roadmap 1.1: streaming capture and scoring.
Not a dashboard or profiler It produces typed results and a sealed audit log, not hosted analytics. Pipe the results into your own observability; analytics is a horizon item.
Quality and validation

The receipts behind v1.0.0.

Tests & coverage

112 tests passing across 14 suites under Vitest 4. Coverage: 93.8% lines, 93.7% statements, 98.7% functions, 88.0% branches. Run pnpm test on a fresh clone to reproduce.

Type safety

TypeScript 6 in maximum strict mode (exactOptionalPropertyTypes, useUnknownInCatchVariables, noUncheckedIndexedAccess, noImplicitOverride, noImplicitReturns). Zero errors under tsc --noEmit.

Lint, format, types

Biome 2 clean across src and tests. publint clean and attw clean. The exports map is dual ESM + CJS with separate .d.ts and .d.cts across all eleven entry points.

Zero dependencies and size

Zero required runtime dependencies. The core entry point is about 4.2 kB brotli, enforced by size-limit in CI. The core is node-free, so it runs in Node, edge runtimes, and the browser.

CLI smoke

An 11-assertion smoke test spawns the built automend binary from dist and exercises the ESM and CJS artifacts, asserting score, inspect, and verify behavior plus their exit codes.

Supply chain

Committed pnpm lockfile, a files allowlist that ships only dist and the core docs, and SLSA provenance attestation on every published version. Verify with npm view @takk/automend@1.0.0 --json | jq .dist.attestations.

Roadmap

What is shipped, what is next, what is later.

Now (1.0)

Shipped in v1.0.0

  • Confidence scoring and GSAR typed grounding
  • Heuristic grounding, loop, and corruption detectors
  • Welford drift baseline and z-score detection
  • Ordered recovery policy, safe mode, retry budget
  • Tamper-evident SHA-256 audit seal
  • Eleven entry points plus the CLI
  • Dual ESM + CJS, SLSA provenance
Next (1.1)

Targeted for 1.1

  • Streaming capture and scoring
  • OpenAI and Anthropic provider-shaped detectors
  • Ed25519 signing on top of the audit seal
  • Observability exporters (OpenTelemetry, Langfuse)
  • Input-matched recovery for concurrent fan-out
Later

On the horizon

  • A model-backed grounding detector (optional drop-in)
  • Native bridges to sibling @takk packages
  • Policy-as-data: serialize and version recovery policies
  • Recovery analytics across runs (a hosted surface)
FAQ

Common questions.

Is AutoMend production-ready at 1.0.0?

Yes, for what it claims. 112 tests across 14 suites pass under Vitest 4 with 93.8% line coverage; TypeScript 6 maximum strict mode is clean; Biome 2 lint is clean; publint and attw are clean. The core has zero runtime dependencies and is about 4.2 kB brotli. Every release carries SLSA provenance. Read the honest limits section above before you adopt it: determinism here is by policy, not magic.

Does AutoMend actually heal the agent?

It automates the loop: it detects the failure, decides a recovery, and calls the executor you provided (retry, rollback, escalate, ask a human). The repair action is your code; AutoMend makes the decision deterministic and auditable. It does not invent a fix on its own.

Does it detect hallucinations on its own?

No, and we will not pretend otherwise. AutoMend scores the grounding classifications you supply, using the GSAR method. The built-in classifyClaims is a lexical heuristic to get you started; for production fidelity, classify with a real natural-language inference model and feed AutoMend the result.

Does AutoMend call a model or the network?

Never. AutoMend makes zero outbound calls. It is a deterministic, in-process scoring and policy engine with zero runtime dependencies. The only platform API it uses is Web Crypto, for the audit seal.

How is the GSAR grounding score computed?

Each claim is one of grounded, ungrounded, contradicted, or complementary, weighted by an evidence-type strength. Grounded adds its weight, complementary adds a fraction, contradicted subtracts a multiple (an asymmetric penalty), ungrounded adds nothing, normalized and clamped to [0, 1]. The score maps to a three-tier decision: proceed, regenerate, or replan.

Does it work in Cloudflare Workers, Vercel Edge, Bun, Deno, or the browser?

Yes. The core and the ./edge entry point are node-free. The tamper-evident seal uses Web Crypto SHA-256, which is available in all of them. Only the CLI is Node-specific.

Is the audit seal a digital signature?

No. sealAuditLog builds a SHA-256 hash chain and verifyAuditLog proves the log was not altered after sealing. It proves integrity, not identity: it tells you the log is intact, not who produced it. Optional Ed25519 signing is on the 1.1 roadmap.

How do I keep secrets out of the audit log?

The audit log records only what you hand it: an entry's summary and its optional data. Omit or redact sensitive values before you pass them in. AutoMend never copies data you did not give it.

Does it support streaming responses?

In 1.0 a stream is scored as its final resolved value, not chunk by chunk. That is enough to recover on the outcome of a streamed call. First-class streaming capture is on the 1.1 roadmap.

How is this different from post-hoc evaluation tools?

LangSmith, Braintrust, and Langfuse analyze runs after the fact, which is excellent for offline quality work. AutoMend runs in-line and acts on the failure with a recovery decision, recorded in a tamper-evident audit trail. They compose; AutoMend does not replace your evals.

How do I verify a published version's provenance?

Every release is published with npm publish --provenance. Check the attestations with npm view @takk/automend@<version> --json | jq .dist.attestations. The attestation links the tarball you installed to the GitHub Actions workflow that built it from a specific source commit.

What is the policy on breaking changes?

Strict SemVer 2.0.0, starting from 1.0.0. The binding stability surface, including the recovery policy and audit shapes, is documented in SPEC.md. Major bumps require a deprecation cycle; security fixes follow the disclosure flow in SECURITY.md.

Author

Built and maintained by David C Cavalcante.

David C Cavalcante

Founder, Takk Innovate Studio

Product Engineer, AI Engineer, ML Engineer, LLM Engineer, LLM Architect, AI Researcher. Builder of the @takk family of NPM packages for Massive Intelligence (IM) and non-human entity (NHE) infrastructure.

AutoMend is the reliability tier of a planned portfolio of NPM libraries targeting Massive Intelligence (IM) infrastructure for 2026 to 2030. It sits in the agent stack alongside the rest of the @takk family. Adjacent research by the author covers the systemic intelligence frameworks MAIC, the universe, HIM, the model, and NHE, the agent, published independently of this codebase, with research notes on PhilPapers and PhilArchive linked from the repository README.

If AutoMend caught a hallucination before it reached a customer, the most useful thing you can do is open a GitHub issue with the case it missed. The releasing runbook, the threat model, and the contributor agreement all live in the repository.