Observability Is Not Governance

Where your open-source AI stack stops being audit-grade.

Observability tells you what happened. Governance controls what’s allowed to happen. Most teams shipping AI features have the first and assume it covers the second. It doesn’t.

This post makes three moves. First, it separates the two things people conflate. Second, it borrows a test auditors have used for decades to show exactly where an observability log stops being evidence. Third, it names the two ceilings open-source tooling hits, maps them to SOC 2, and tells you what to do at each one.

A trace is a record.

A record is not evidence.

Here’s the setup most AI teams reach for. The app calls an LLM. A guardrail library checks the input and output. A tracing tool records the call. Pick the popular open-source parts and it works: Guardrails AI or NeMo Guardrails for the checks, OpenLLMetry to instrument, Langfuse to store the traces. You can stand the whole thing up in a week.

Then an enterprise buyer asks for your AI audit trail, and the stack quietly fails the question.

It fails because a log is not the same thing as audit-grade evidence, and that distinction is older than LLMs. Financial auditors have a four-part test for whether a record counts as evidence. It is contemporaneous (created when the control ran, not reconstructed later), complete (covers the full period, not a sample), attributable (shows who did what, when), and consistent (proves the control runs every time, not once).

Run an LLM trace through that test. It’s contemporaneous — good. It’s attributable — good. But a single trace is one event, and an event is not period coverage. It shows the guardrail fired on Tuesday. It says nothing about day 47 of a 365-day audit window.

A trace proves something happened once. SOC 2 Type 2 asks whether the control operated continuously. Those are different claims, and only one of them is in your traces.

Open source has two ceilings.

One is enforcement. The other is evidence.

The parts you can assemble for free run out in two specific places. Knowing where is the whole game.

The enforcement ceiling. Guardrails AI and NeMo Guardrails enforce rules in code. That’s real. But a developer can edit the rule, weaken it, or comment it out, and nothing stops the deploy. Enforcement that any engineer can bypass with a config change is not a control an auditor trusts. It’s a suggestion that usually gets followed. The gap between “we have a guardrail” and “the guardrail is a control” is change management: who can change it, was the change reviewed, is it versioned, is it approved.

The evidence ceiling. Langfuse and OpenLLMetry write traces to a database. By default that database is mutable — anyone with access can update a row. The word “immutable” gets used a lot here, and it’s marketing until you add the boring infrastructure that earns it: append-only storage, signing, retention locks. A trace store is not a compliance vault because you called it one.

	Observability stack	What governance requires
Guardrail	Checks output in code	Cannot be disabled without review
Logging	Records every call	Tamper-evident, full-period coverage
Control owner	Whoever wrote the code	Named, with approval workflow
Proof to auditor	“Here’s a trace”	“Here’s the control operating across the period”

This is why a regulated AI deployment is a different product category than LLM observability, not a feature you bolt on. The vendors who win combine policy enforcement, monitoring, evidence, and deployment in one place. Prediction Guard sells exactly that as a self-hosted control plane aligned to NIST AI RMF and the OWASP LLM Top Ten — closed source, fixed price, talk to sales. Open source hands you the components. It does not hand you the guarantee.

Map it to the controls.

The auditor already has the questions.

None of this is new to SOC 2. The framework covers AI without naming it, which is the part teams miss.

CC6.1 — logical access and data protection. Your guardrail blocking PII before it leaves the model is a CC6 control. Whether it’s a real control depends on whether it can be bypassed.
CC7.2 — monitoring and anomaly detection. Your traces are the raw material. They become CC7 evidence only when they’re complete over the period and tamper-evident.
CC8.1 — change management. This is the one that turns a guardrail into a control. If a prompt or a validation rule changes, is there a version, a review, an approval? Without it, your enforcement layer is undocumented and an auditor treats it as such.

Forward-leaning auditors are already asking how AI agents are logged and constrained. The Trust Services Criteria they’ll reach for — CC6, CC7, CC8 — are the ones above. The teams that get caught out are the ones who confused having traces with having controls.

Where this breaks.

Tooling organizes the work. It doesn’t certify it.

Be honest about the limit, because the limit is the credibility.

Even a perfect control plane does not make you compliant. Compliance is the control operating over the period, plus evidence that survives the four-part test, plus an auditor’s judgment that it’s true. No tool produces that. The best tools cut the hours and keep the evidence clean. They don’t replace the auditor and they don’t replace the decision to design the control properly in the first place.

There’s a deeper ceiling too. A guardrail enforces a policy. It cannot tell you the policy is right. You can have flawless enforcement of a rule that doesn’t actually protect anyone, logged perfectly, mapped neatly to CC6. The stack guarantees fidelity to the policy you wrote, never fidelity to the outcome you wanted. That gap closes one way only: a human deciding what the rule should be. Tooling can’t reach it.

Apply this.

Start with the two questions you’re conflating.

Split enforce from record. For each AI feature, write two columns: what is enforced (a guardrail blocks it) and what is recorded (a trace logs it). Most teams discover they have a lot in the second column and almost nothing in the first.
Test each guardrail for bypass. Ask: can a developer disable this without a review? If yes, it isn’t a control yet. Put it behind change management — version it, require approval — and now it maps to CC8.
Run your logs through the four-part test. Contemporaneous, complete over the full period, attributable, consistent. Every “no” is a gap. The most common one is completeness: snapshots don’t prove continuous operation.
Decide your buy line. Either configure the open-source stack to its actual ceiling and accept what it can’t do, or buy the control plane that crosses it. Both are fine. Pretending the free parts cross a line they don’t is the only wrong answer, and it’s the one an auditor finds for you.

If you’re shipping AI into a regulated buyer’s environment and you’re not sure where your stack stops being audit-grade, that line is worth finding before a customer’s security review finds it for you. That’s the kind of thing worth a conversation.

Observability Is Not Governance#

A trace is a record.#

Open source has two ceilings.#

Map it to the controls.#

Where this breaks.#

Apply this.#