The Audit You Are Not Ready For

There is a particular silence that settles over a model governance team when a supervisory letter arrives asking them to reconstruct a lending decision from fourteen months ago. Not a sampled decision. A specific one — applicant X, branch Y, declined on a date the regulator has already pulled from the borrower's complaint. The team knows the model performed within tolerance. They have dashboards. They have a model card. And almost none of that will answer the question being asked.

The uncomfortable truth is that audit readiness in AI-driven lending has very little to do with documentation and almost everything to do with infrastructure. The questions supervisors are beginning to ask are not "describe your model risk framework." They are "prove, from tamper-evident records, exactly why this borrower was declined, under which model version, with which thresholds active, and what recourse you issued." Most institutions cannot answer that — not because they were negligent, but because their systems were never architected to produce evidence. They were architected to produce decisions.

Section 1: What an AI Lending Audit Actually Interrogates

A single lending decision is not one fact. It is a chain of dependent facts, and a regulator examining it for fairness, reproducibility, or recourse compliance traverses every link:

model_id
  → model_version
    → training_data_snapshot_hash
      → feature_schema_version
        → threshold_configuration_id
          → input_feature_vector
            → prediction_output
              → SHAP_attribution
                → recourse_issued
                  → governance_controls_active

Each node answers a question the supervisor is entitled to ask. model_id and model_version answer which model, in which state, produced this? training_data_snapshot_hash answers on what data was that version trained, and can you prove the data has not been silently altered? feature_schema_version answers what did each input feature mean at decision time — because a feature named income_stability in v3 of the schema may have been computed differently than the identically named feature in v5. threshold_configuration_id answers what cutoff converted the model's probability into an approve/decline. The input_feature_vector and prediction_output are the decision itself. SHAP_attribution answers why — the per-feature contribution to this specific score. recourse_issued answers what actionable path to approval was communicated, a direct FREE-AI and fair-lending obligation. And governance_controls_active answers which guardrails — bias checks, override logging, human-review gates — were live when this ran.

Every link must be independently verifiable and tamper-evident. Independently, because a chain that can only be validated as a whole is a chain a regulator must take on trust, and supervisory examination does not run on trust. Tamper-evident, because the entire evidentiary value of the record collapses the moment it becomes possible to alter it without detection. An audit trail you could have edited is, evidentially, an audit trail you did edit.

Now walk the failure modes, node by node, because this is where institutions discover the gap:

Incomplete model_version records. If decisions are logged against "the production model" without binding to a specific, immutable version artifact, you cannot prove which model scored this applicant. When you retrained in May, every decision before and after collapses into the same undifferentiated bucket. The regulator asks "which version declined this borrower?" and the honest answer is "we don't know."
Threshold changes not version-controlled. The model probability was 0.61. Approve or decline? It depends entirely on whether the cutoff was 0.60 or 0.65 on that day. If threshold adjustments live in a config file someone edits in place, or worse, in a business rule changed through a console with no snapshot, the decision is unreproducible. You can rerun the model and get 0.61 forever and still not know the outcome it produced.
SHAP values not stored at decision time. This is the most common and most damaging. The institution recomputes SHAP attributions during the audit, against the current model, and presents them as the explanation for a decision made by a prior model on a prior feature schema. The numbers are fiction — internally consistent fiction, but fiction. The explanation does not correspond to the decision under examination.

Section 2: The Immutability Problem

Conventional databases are built for the opposite of what an audit requires. They are optimised for UPDATE and DELETE — for the current truth, efficiently maintained. A mutable record is, by construction, evidentially weak: there is no cryptographic obstacle to changing it, and therefore no cryptographic proof that it was not changed. "We have a decisions table" is not the same statement as "we can prove this decision record is exactly as it was written at inference time."

An evidentiary decision ledger has four non-negotiable technical properties.

Append-only storage. Records are written once. There is no update path and no delete path at the application layer; corrections are new records that reference and supersede prior ones, never overwrites. History is accretive, never destructive.

Hash-chaining. Each record carries the cryptographic hash of the record before it, so the ledger forms a chain:

$$ H_n = \text{SHA256}\big(,\text{content}n ,\Vert, H{n-1},\big) $$

Altering any historical record changes its hash, which breaks every downstream link in the chain. Tampering is not merely discouraged — it is detectable, deterministically, by recomputing the chain.

Hash verification at retrieval. Integrity is not assumed because the record sits in a "secure" system. Every read recomputes the hash of the retrieved content and checks it against the stored chain. A record that fails verification is flagged, not served as evidence.

Controlled access with access logs. Who read which decision record, when, and why is itself an auditable event. The access log is part of the evidentiary surface — supervisors examining a discrimination complaint care a great deal about who touched the record after the complaint was filed.

The architectural pattern that makes this clean is content-addressable storage. Each decision record is identified not by an auto-incrementing ID but by the SHA-256 hash of its own content. The identifier is the integrity check. If a single byte of the record changes — a flipped threshold, an edited SHAP value, a backdated timestamp — the hash changes, the address changes, and the alteration is self-evident. You cannot quietly modify a content-addressed record and leave it sitting at the same address; the address would no longer point to it.

This is the distance between "we have logs" and evidentiary integrity. Logs tell you what the system believes happened. Evidentiary integrity lets you prove, to an adversarial examiner, that what the record says is what occurred and has not been touched since. Most institutions have the former and discover, mid-audit, that the regulator was asking for the latter.

Section 3: The 14-Month Reconstruction Exercise

Make it concrete. The supervisor writes:

"Show me every decision made on self-employed applicants in Tier 2 cities between April and June 2025."

This is not one query. It is a reconstruction across a moving system, and it demands four capabilities at once:

Query across archived model versions. Three different model versions may have been live during that quarter. Each decision must resolve to the version that actually produced it — not to whatever is in production today.
Threshold configurations active during the window. For each decision, the specific threshold_configuration_id in force at that timestamp, retrievable as it existed then, not as it exists now.
Segment labelling consistency across schema changes. "Self-employed" and "Tier 2 city" must mean the same thing across the window — or, where the schema changed, the mapping between definitions must be explicit. If employment_type was re-encoded in May, a naïve segment query silently mixes two populations and produces a fairness statistic that is meaningless.
SHAP values stored per decision. Attributions captured at inference, bound to the exact feature vector and model version, retrievable per decision.

In practice, institutions hit three failure points in this exact exercise, and they are worth naming precisely because they are so predictable:

(a) Model version records not linked to individual decisions. The decisions table records the score and outcome but not the immutable version that produced them. You can no longer say which model scored whom. The segment-level fairness picture the regulator wants becomes unconstructible at the decision level.

(b) Threshold changes undocumented between model versions. The model versions are archived, but the cutoffs were tuned between retrainings — quietly, in production config — and never snapshotted. The outcomes are no longer reproducible even with the right model in hand, because the function that turned probability into decision is gone.

(c) SHAP values computed retroactively instead of stored at inference time. This is the failure that looks like a success and is therefore the most dangerous. The team confidently regenerates explanations and submits them. They are recomputed against today's artifacts. They do not explain the decisions under audit. Retroactive SHAP is the wrong answer delivered with the confidence of the right one — and a competent examiner who asks how the attributions were produced will catch exactly this.

Section 4: Governance Drift as Audit Risk

Governance drift is the slow accumulation of undocumented threshold adjustments, policy overrides, and feature schema changes that cause actual decision behaviour to diverge from documented model behaviour. Not model drift — your PSI and KS may be perfectly healthy. The data distribution can be stable, the model statistically sound, and the institution still wildly out of compliance, because the gap is between what the documentation describes and what the system actually does.

This is precisely why governance drift creates audit exposure that statistical monitoring is blind to. PSI and KS measure whether the inputs and scores are shifting. They say nothing about whether someone nudged a cutoff from 0.60 to 0.63 for a particular segment in March, applied a manual override policy to a product line in April, and re-derived a feature in May — each change reasonable in isolation, none of them recorded against the governance baseline. The model card now describes a system that no longer exists. Statistically you are fine. Evidentially you are exposed, because you cannot reconcile documented behaviour with observed behaviour, and that reconciliation is exactly what an audit demands.

A governance drift monitoring system closes that gap with three mechanisms:

Automated comparison of threshold configuration snapshots. Every threshold set is snapshotted and versioned; the monitor continuously diffs the live configuration against the last approved baseline and surfaces any divergence as a governance event requiring sign-off. There is no such thing as a silent cutoff change.
Segment-level outcome distribution monitoring. Approval rates, decline reasons, and recourse issuance tracked per segment over time, so a quiet shift in how self-employed Tier 2 applicants are treated raises a flag long before it becomes a complaint and then a supervisory letter.
Policy change versioning. Overrides, business rules, and review gates are versioned artifacts with effective-from timestamps and approver identity — not console edits. Every policy in force at any past moment is reconstructible.

Audit-Ready by Design

The thread running through all of this is a single architectural choice about when governance infrastructure is built. The institutions that fail these tests did not lack governance intent. They built decision infrastructure first and tried to layer evidentiary infrastructure on top of it later — typically in the weeks before an audit, working backward from records that were never designed to be evidence.

Audit-ready by design inverts the order. It means the immutable decision ledger, the at-inference SHAP capture, the versioned thresholds, the content-addressed records, and the drift monitors are provisioned at model deployment time, as part of what it means to put a model into production at all. Every decision is born into a tamper-evident chain. Every explanation is captured against the model and schema that produced it. Every threshold and policy change is a versioned, signed event. The evidence is a byproduct of operating correctly, not an artifact assembled under deadline.

The distinction is not academic. Retrofitted audit trails answer the questions you anticipated. Designed-in evidentiary infrastructure answers the question the regulator actually asks — including the one about a single borrower, fourteen months ago, that you have not thought about since. If reading this produced a quiet inventory of which of these nodes your current stack cannot reconstruct, that recognition is the point. The audit is not a documentation deadline you can meet with effort. It is an infrastructure property you either have or you do not — and you find out which on the day the letter arrives.

Frequently Asked Questions

What is the difference between model drift and governance drift? Model drift describes statistical change in input data or score distributions, measured by metrics like PSI and KS. Governance drift describes the divergence between documented model behaviour and actual decision behaviour caused by undocumented threshold, policy, and schema changes. A system can be statistically stable (no model drift) while being severely exposed to governance drift, which is why distribution monitoring alone does not establish audit readiness.

Why can't a conventional database serve as an audit trail? Conventional databases support in-place updates and deletes, so a record offers no cryptographic proof it has not been altered. Evidentiary integrity requires append-only storage, hash-chaining so that any historical change is detectable, hash verification at retrieval, and access logging. Having logs is not the same as being able to prove those logs are unaltered.

Why must SHAP values be stored at inference time rather than recomputed during an audit? Recomputed SHAP attributions are calculated against the current model and feature schema, not the version that produced the original decision. They do not explain the decision under examination — they explain a different decision the current system would make. Only attributions captured at inference, bound to the exact model version and feature vector, are evidentially valid.

What does "audit-ready by design" mean in practice? It means provisioning evidentiary infrastructure — an immutable decision ledger, at-inference explanation capture, versioned thresholds and policies, and governance drift monitoring — at model deployment time, so that audit evidence is a byproduct of correct operation rather than something reconstructed retroactively before an audit.