Know when your model will fail — before it happens

Language models in production break in ways you cannot predict by looking at inputs alone. ACT monitors the output — the only thing you can always observe — and tells you when something is changing.

The problem

Every organization deploying LLMs faces the same blind spot: you don't control what the model does. You can filter inputs. You can add guardrails. But the model's behavior can shift — due to adversarial manipulation, silent vendor updates, context accumulation, or simply because the model operates differently at scale than it did in testing.

Input-side defenses catch known patterns. They miss everything else. White-box monitoring requires access to weights most deployers don't have. Using another LLM to judge the first one is expensive, unreliable, and circular.

What you need is a way to detect that the model's behavior is changing — regardless of why. That's what ACT does.

Three components, one system

ACT is built on three independent research efforts that work together to give you full visibility over your model's behavior.

ACT — Passive Monitoring

Attractor Conflict Telemetry

ACT computes 24 deterministic metrics on every model response. No access to weights, no API to the model's internals — just the text it produces.

The principle is straightforward: when a model's internal behavioral balance shifts — whether due to adversarial pressure, degraded alignment, or any other cause — the output text changes in measurable ways. Less hedging. Different sentence structures. Shifts in vocabulary distribution. These changes are often invisible to a human reader but statistically detectable.

The metrics are organized across six levels of analysis, from basic token statistics to composite scoring. Each metric is compared against a deployment-specific baseline using z-scores, producing calibrated alerts — green, yellow, red — with no false-positive noise from irrelevant thresholds.

What this means for you:

— Continuous, real-time visibility into model behavior in production
— Detect behavioral drift before it reaches your users
— Works with any model, any provider — no integration required beyond reading the output
— Deterministic: same input always produces same measurement, no stochastic variability

ACTIVE — Controlled Probing

Attractor Conflict Testing via Induced Elicitation

Passive monitoring tells you what the model is doing. Active probing tells you what the model would do under specific conditions.

ACTIVE sends controlled stimuli — carefully designed prompts — to the model and measures how it responds. This allows you to map the model's behavioral boundaries: where does it start refusing? Where does it become unreliable? How does it react to different framing of the same request?

The system classifies behavioral changes into distinct patterns: progressive drift (slow degradation over many turns), acute collapse (sudden failure), boundary oscillation (inconsistent behavior on similar inputs), and sub-threshold migration (changes too small to trigger alerts individually, but significant over time).

What this means for you:

— Proactively test your model's reliability before problems reach production
— Compare models objectively — same tests, same metrics, quantified differences
— Detect silent vendor updates that change model behavior without notice
— Understand where your model's behavioral limits are, not just that they exist

SIGTRACK — Behavioral Memory

Signature Tracking for Regime and Attractor Conflict Knowledge

ACT measures. ACTIVE tests. SIGTRACK remembers.

Every behavioral observation is recorded in an append-only, hash-chained forensic ledger — an immutable audit trail with cryptographic integrity. When an incident occurs, you can reconstruct exactly what the model was doing in the hours and days before it happened.

Beyond storage, SIGTRACK extracts behavioral signatures — recurring patterns in metric trajectories that characterize specific types of behavioral change. Once the system has seen a pattern, it recognizes it instantly the next time. This means faster detection, fewer false positives, and a system that gets better with operational experience.

What this means for you:

— Full audit trail for compliance, incident investigation, and regulatory evidence
— Pattern recognition that improves over time — the system learns from confirmed events
— Forensic reconstruction: trace any incident back to its earliest warning signs
— Tamper-proof records — cryptographic hash chain ensures data integrity

Deeper analysis layers

Beyond raw metric telemetry, ACT includes two advanced analysis engines that classify behavioral posture at the sentence level and across multi-agent systems.

PSA v2 — Posture Sequence Analysis

Sentence-level behavioral classification via 5 micro-classifiers

ACT's metric layer tells you that something changed. PSA v2 tells you what kind of change it is. It runs five independent micro-classifiers over every sentence in a model response, each trained to recognize a specific behavioral pattern.

The classifiers operate at the sentence level, which means they can detect subtle behavioral inconsistencies that only appear in part of a response — not just across the whole turn. A model can be cooperative in three sentences and show adversarial posture in the fourth; PSA v2 catches that.

The five classifiers:

C0 Language & intent — Identifies the language and high-level intent of the response.
C1 Adversarial stress posture — 16-class taxonomy measuring how the model responds under pressure. Produces POI (Posture Oscillation Index), PE (Posture Entropy), and DPI (Dissolution Position Index).
C2 Sycophancy density — Detects excessive agreement and validation-seeking patterns (SD score).
C3 Hallucination risk index — Flags sentences with characteristics associated with confabulation (HRI score).
C4 Persuasion density & technique diversity — Identifies rhetorical manipulation patterns and how many distinct techniques are present (PD, TD).

The five classifier outputs are combined into a single Behavioral Health Score (BHS) — a normalized 0–1 index where 1.0 means no detected behavioral anomaly. BHS drops when sycophancy is high, when adversarial stress posture is oscillating, or when persuasion density exceeds expected norms.

What this means for you:

— Know not just that behavior changed, but whether it's sycophancy, adversarial drift, or hallucination risk
— Sentence-level granularity: pinpoint which part of the response is problematic
— Single BHS metric per turn for easy monitoring and alerting
— Works alongside v1 ACT metrics — same session, complementary signals

PSA v3 — Agentic Posture Sequence Analysis

Multi-agent behavioral analysis with graph topology, Swiss Cheese detection, and temporal prediction

Single-agent analysis assumes there is one model to monitor. Agentic systems — where multiple LLMs interact, delegate to each other, and call external tools — create a completely different risk surface. PSA v3 was built for this.

You submit an agent interaction trace: a sequence of nodes (agent outputs) and edges (delegation, correction, tool calls, results). PSA v3 builds a directed acyclic graph from this trace and runs four distinct analysis pipelines on it simultaneously.

The four pipelines:

① Per-node PSA v2 — Each node in the graph gets full C1–C4 classification and a BHS score. Behavioral anomalies are detected at the individual agent level before they propagate.
② Bayesian Swiss Cheese detection — Models the multi-agent pipeline as a stack of defenses. Computes the Swiss Cheese Score (SCS): the probability that behavioral anomalies at multiple nodes align into a system-level failure path. Identifies which "holes" are open and at what depth.
③ Cross-agent metrics — Computes Posture Propagation Index (PPI), Cascade Alignment across the graph, Weighted Load Score (WLS), Contagion Effect Rate (CER), and the Critical Agent Health Score (CAHS). Identifies the highest-risk path through the agent graph.
④ C5 action-risk classification + Posture-Action Incongruence (PAI) — Classifies every tool call by risk level (C5 classifier: none/low/medium/high/critical). Computes PAI: the mismatch between a node's textual posture (C1) and the risk of its actions. A model that sounds helpful but executes high-risk tool calls scores high on PAI.

Finally, PSA v3 feeds the graph's behavioral state into a Hidden Markov Model that predicts future states across a configurable horizon. It estimates how many turns until the system reaches a red-alert state and provides an early warning level with an actionable recommendation.

What this means for you:

— Monitor multi-agent systems as a whole, not just individual model outputs
— Detect when a behavioral anomaly in one agent is likely to cascade to others
— Catch Posture-Action Incongruence: agents that appear safe textually but take high-risk actions
— Predictive early warning: know how many turns before system behavior reaches a critical threshold
— Swiss Cheese score provides a single, interpretable system-level risk indicator

Why output-only monitoring works

The core principle behind ACT is simple: when a model's behavior changes, the text it produces changes too. Not just in content — in statistical structure. Sentence length distributions shift. Vocabulary diversity changes. The balance between cautious and assertive language moves.

These changes happen because the model's internal state determines its output distribution. A model under adversarial pressure doesn't produce the same text as a model operating normally — it can't, because the probability distribution it's sampling from has been altered. The behavioral shift and the statistical signature are the same phenomenon, observed at different levels.

This means that any effective manipulation leaves a trace in the output. An attacker who eliminates all statistical signatures has, by definition, failed to change the model's behavior. There is a fundamental trade-off between attack effectiveness and evasion — and ACT sits exactly at that boundary.

Start monitoring your models

24 metrics. Real-time analysis. No model access required.

Get Started API Docs