Skip to main content

FIELD NOTE

AI Fairness in Credit Scoring: DTPG, RVDR, and CFDI Explained

NomoCrit

Fairness in lending AI is a behaviour, not a property. Learn how three operational metrics — DTPG, RVDR, and CFDI — make it measurable across LightGBM + HDBSCAN architecture, and why conventional MLOps monitoring misses all three.

The Three Numbers That Tell You If Your Lending Model Is Fair

Most fairness conversations in Indian lending collapse into a binary: the model is fair, or it is not. That framing is operationally useless. A credit decisioning system does not have a fairness property the way it has a file size. Fairness is a behaviour — it emerges from how you segment the portfolio, what thresholds are applied to which borrowers, and whether the recourse advice given last quarter still holds after your last retrain or model update.

This post is about making that behaviour measurable. Fairness in lending AI decomposes into three independent operational quantities, each capturing a different failure mode, each computable from data you already have, and none reducible to the others. We call them DTPG, RVDR, and CFDI. If you can report these three numbers for your current production model, you know more about your model's posture than most detailed audits will surface.


DTPG

Measures: Profitability gain from dynamic thresholds
Detects: Threshold inefficiency

RVDR

Measures: Recourse degradation after model changes
Detects: Expired borrower guidance

CFDI

Measures: Fairness inconsistency across clusters
Detects: Hidden segment-level bias


DTPG: Dynamic Threshold Profitability Gain

The first question any compliance or model-risk conversation must answer is: does segmenting your portfolio actually change outcomes, or is your static cut-off score working fine? DTPG answers this quantitatively.

Definition

DTPG (%) = [(EMPdynamic − EMPstatic) / |EMPstatic|] × 100

where EMP is the Expected Maximum Profit — the peak expected profit over the test population under the given threshold policy. DTPG is the percentage gain in portfolio profit when you replace a single global threshold with a cluster-aware, per-segment threshold matrix, evaluated on the same held-out data.

What it tells you

A positive DTPG is empirical proof that your current static threshold is suboptimal — that the population you score is genuinely heterogeneous, and that a borrower's cluster membership contains information a single number cannot encode. A DTPG of zero means your population is so uniform that segmentation adds nothing.

On the CRIF Highmark UP SVCL/SBL dataset (50,000 synthetic borrowers, 30.47% NPA rate), the NomoCrit v2 pipeline — running LightGBM + HDBSCAN — produced a DTPG of +217.3%: dynamic EMP of ₹17.9 Cr versus a static baseline of -₹15.3 Cr, a ₹33 Cr swing in a single scoring period. This compares against v1's +160.9% (XGBoost + KMeans), a gain of +56 percentage points driven by three compounding improvements:

  • LightGBM's APS of 0.9348 (+3.3pp over v1's 0.9017) produces tighter risk-rank ordering, steepening the profit curve per cluster
  • HDBSCAN's 13 density-based segments allow finer threshold granularity: C0 Mid-Risk is assigned t=0.05 (35.3% approval) while C9 High-Risk PriSec gets t=0.16 (1.3% approval) — a calibration impossible with 5 forced KMeans blobs
  • The static floor improves too: LightGBM's higher discrimination means fewer borderline NPAs slip through at t=0.25, shrinking the static loss from -₹25.3 Cr to -₹15.3 Cr

C0 Mid-Risk Portfolio (23.2% of borrowers) generates 92% of total dynamic profit at ₹16.5 Cr — this is the segment that static thresholds perpetually underprice. HDBSCAN's ability to isolate it without forced spherical boundaries is where the extra ₹2.5 Cr over v1 actually comes from.

A critical new finding unique to v2: the 9.8% noise pool (4,889 borrowers, NPA=63.8%). These are genuine outliers that KMeans would have force-assigned to its nearest cluster and potentially mislabeled into a batch approval or rejection. HDBSCAN flags them correctly as unclassifiable — they should be routed to manual underwriting, not auto-decisioned.

Interpreting DTPG

  • DTPG = 0 — static and dynamic policies are financially equivalent; segmentation provides no measurable portfolio benefit at the current model's resolution
  • DTPG > 0 — dynamic policy outperforms; the size of the gain reveals how much risk heterogeneity your segments capture
  • DTPG < 0 — your per-cluster threshold calibration is worse than the global baseline, usually indicating insufficient data in some cells or overfitting to training-set EMP

RVDR: Recourse Validity Degradation Rate

Under RBI's emphasis on grievance redressal and explainable digital lending, declined borrowers must receive actionable guidance. RVDR measures whether the guidance you issued is still true after a model update. In v2, this metric saw the most dramatic improvement of any in the pipeline.

Definition

RVDR (%) = |Rvalid,v1 \ Rvalid,v2| ÷ |Rvalid,v1| × 100

Concretely: take all counterfactual recourse recommendations that were valid under model version v1. After retraining to v2, re-evaluate every one of those recommendations against the new model and threshold. RVDR is the fraction that are now invalidated — the borrower followed your instructions and would still be rejected.

The v1 baseline failure

Under v1 (DiCE unconstrained counterfactuals on XGBoost), RVDR was 0.9903 — meaning 99% of issued recourses became invalid upon model update. This is not a statistical anomaly; it is the expected behaviour of unconstrained DiCE perturbation. DiCE finds the nearest decision boundary in feature space with no knowledge of causal structure. When the model retrains, those boundary-adjacent perturbations land in a different region entirely. The borrower follows advice that was geometrically valid yesterday and gets rejected again today.

The v2 improvement: causal DAG-constrained recourse

The v2 pipeline replaces DiCE with a CARLA-style causal recourse engine, constrained by a DAG that locks immutable features (disbursal_year, district_historical_npa_rate, lender_type_encoded) and restricts mutable actions to three levers: ticket_ordinal, concentration_ratio, and product_type_encoded. The pipeline never tells a borrower to move to a different district — a well-documented DiCE failure mode in geographically-structured credit risk.

Recourse Metric v1 (DiCE) v2 (CARLA DAG) Δ
Success Rate 36% 82% +46pp
Causal Validity (DAG) N/A 100% New
Mean L2 Distance N/A 0.970 New
RVDR 0.9903 0.2047 −0.785
RVDR 95% CI [0.15, 0.27] STABLE

RVDR dropped from 0.99 to 0.205 — causal recourse is 4.8× more stable across model updates than unconstrained perturbation. The 82% success rate means 82% of sampled small-ticket rejected borrowers have a valid, causally-grounded action path to approval. The Wilson confidence interval [0.15, 0.27] confirms stability; this is not a lucky sample result.

What expires recourse

A recourse recommendation is a snapshot of the decision boundary at the moment it was generated. That boundary moves when:

  • The model retrains — new coefficients shift which feature values cross the threshold
  • The threshold recalibrates — a yield or risk adjustment changes the cut-off the counterfactual was computed against
  • Feature engineering changes — redefining a derived variable silently changes what "compliance" means
  • Credit policy amendments — a new hard rule the original counterfactual never encoded

The legal dimension

RVDR is not just a hygiene metric — it is a liability metric. Issuing actionable recourse creates an implicit representation that following the advice will change the outcome. When recourse silently expires and a borrower is declined a second time after complying, the lender has made and broken a promise without notification. That is precisely the kind of consumer-protection complaint that is most documentable and most defensible for the borrower. v1's 0.99 RVDR means essentially every borrower who acted on your advice was set up to fail again. v2's 0.205 is still not zero — but it is a governed number with a confidence interval you can defend.


CFDI: Cluster-Fairness Disparity Index

CFDI surfaces fairness failures that aggregate statistics actively hide. A model can pass every population-level fairness check and still exhibit severe within-cluster disparities across the very segments its thresholds target.

Definition

CFDI = std(EODc)c∈C + std(DIRc)c∈C

where for each cluster c:

  • EOD (Equalized Odds Difference) — the difference in true positive rates between the protected group and the reference group within that cluster
  • DIR (Disparate Impact Ratio) — the ratio of approval rates between the protected group and the reference group within that cluster; 1.0 is perfect parity, 0.0 means the protected group receives zero approvals

CFDI is not a per-cluster check — it measures how inconsistently the model treats the protected group as you move from one borrower segment to another.

The v2 CFDI result and what it means

Fairness Metric v1 v2 RBI Threshold Status
DIR Baseline (Ticket Size) 0.3634 0.3740 ≥ 0.80 FAIL
DIR Post-Remediation 0.4242 0.4298 ≥ 0.80 FAIL
Fair-Model AUC 0.9881 ✓ Stable
CFDI 0.131 0.126 < 0.10 = Green Amber

CFDI improved slightly from 0.131 to 0.126 — moving toward the 0.10 green threshold but remaining in the AMBER band. The DIR failure, however, is not a model bug and cannot be fixed by threshold tuning or reweighting. It reflects a genuine structural disparity: small-ticket borrowers (approval rate 34.4%) face a business-risk reality that algorithmic remediation cannot overcome because ticket_size is a legitimate risk predictor, not a proxy for a protected demographic attribute. The underlying CRIF UP portfolio tells you exactly why: NPA rate by count is 30.47% but by value only 6.50% — NPA risk is concentrated in small-ticket, priority-sector borrowers, not distributed across the book. This count/value divergence is the key credit policy insight: the DIR failure axis is ticket size because that is where actual default risk lives in this portfolio. The worst single product type reaches 47.87% NPA (Business Loan Priority Sector Small Business); the worst district reaches 69.28% NPA.

With HDBSCAN's 13 clusters, v2 can now make per-segment product recommendations that v1 (5 KMeans blobs) could not. The correct regulatory path is not threshold manipulation but product-level redesign for small-ticket segments — group lending mechanisms, guarantee-backed disbursals — which NomoCrit's v2 cluster personas can now prescribe with causal structure intact.


Why Conventional MLOps Platforms Miss All Three

The gap is categorical, not incremental. Standard MLOps monitors model performance — whether the model still predicts what it predicted in training. Governance requires monitoring decision consequences — what the model's outputs do to people and to regulatory exposure. These are different objects.

Dimension Performance Monitoring Governance Monitoring (DTPG / RVDR / CFDI)
Core Question Is the model still accurate? Are decisions still profitable, actionable, and fair across borrower segments?
Metrics AUC, Log Loss, PSI DTPG, RVDR, CFDI
Unit of Analysis The prediction The decision and its downstream impact on the borrower
Drift Detected Feature distribution shift (PSI) Outcome dispersion drift, recourse degradation, and threshold misalignment
Segment Awareness Pooled / Global Native per-cluster monitoring
Threshold Governance Fixed input, rarely audited Primary governed object
Failure Caught Model performance degrades silently Model remains accurate, but decisions become unprofitable, recourse becomes stale, or fairness deteriorates within segments

The v2 pipeline makes this concrete with a real example: LightGBM holds AUC = 0.9884 and PSI < 0.01 — perfectly green on every MLOps dashboard — while the fairness audit simultaneously shows DIR = 0.374, well below the RBI FRFI threshold of 0.80. Performance monitoring reports clean. The decisions are not.


How NomoCrit v2 Tracks These in Production

NomoCrit v2 treats the threshold configuration and model version as the two governed coordinates, computing all three metrics against every (model_version, threshold_config) pair in production.

DTPG is computed at every threshold recalibration event, comparing the incoming per-cluster EMP matrix against the outgoing baseline on the same held-out evaluation set. The v2 upgrade from +160.9% to +217.3% is itself a DTPG delta — the risk team saw the exact ₹2.5 Cr per-period improvement before the model shipped.

RVDR is maintained as a live ledger of outstanding causal recourse recommendations, each stamped with the model version, threshold configuration, and DAG constraints it was computed against. On every trigger — model redeploy, threshold shift past tolerance, credit policy amendment — NomoCrit re-evaluates the full ledger and flags the exact borrowers whose advice has expired so notification workflows fire before those borrowers reapply. At RVDR=0.205 with CI [0.15, 0.27], the governed exposure is known and bounded.

CFDI is recomputed on every scoring batch. With 13 HDBSCAN clusters, EOD and DIR are aggregated per segment, std(EOD) and std(DIR) are summed into the CFDI scalar, and crossing the 0.15 watch band raises a calibration-review event tied to the specific divergent cluster pair. The v2 result of 0.126 puts the portfolio in the AMBER band — monitored, not alerted.

Across versions, the three numbers form a time series indexed by deployment. The result is an auditable record that answers the regulator's hardest question — show me that this decision was profitable, that the advice you gave was honoured, and that your fairness held within segments — with three numbers and their history, not an argument.


NomoCrit v2 pipeline ran in 24.3 seconds on 50,000 synthetic CRIF UP borrowers. All metrics above are from that production run. The Taiwan Credit benchmark results from v1 remain valid as a cross-dataset reference point; v2 results cited throughout are from the CRIF UP SVCL/SBL dataset.