Thin-File Borrowers and the Next Frontier of Lending Growth

The most consequential lending opportunity in India is hiding in a definitional gap. A borrower who repays a phone bill on time for thirty-six straight months, files GST returns every quarter, and runs a stable current account is, to a traditional bureau model, indistinguishable from someone with no financial existence at all. Both produce the same output: insufficient data. The opportunity is not to invent a new loan product. It is to deploy — under governance — the alternative data infrastructure India has already built, to make the credit-invisible legible without quietly rebuilding the exclusion we are trying to escape.

1. Defining the Thin-File Problem Technically

A thin-file borrower is best defined not by behaviour but by the geometry of their data. Formally, a borrower is thin-file when their credit-history feature vector $\mathbf{x} \in \mathbb{R}^d$ has an effective dimensionality below the minimum required for a bureau score to achieve predictive validity. In operational terms, the industry threshold is a tradeline history shorter than six months combined with fewer than two active credit facilities. Below this boundary, the feature vector is too sparse to estimate the conditional default distribution $P(\text{default} \mid \mathbf{x})$ with acceptable variance.

This produces the cold-start problem in credit. It is critical to separate two quantities that the market routinely conflates: the borrower's true default probability $p^$, and the estimator's variance $\mathrm{Var}(\hat{p})$ around it. The absence of formal credit history says nothing about $p^$ — a first-time borrower is not inherently riskier. What the absence of history does is inflate $\mathrm{Var}(\hat{p})$, because the model has no observations on which to condition. A risk-averse lender then prices or rejects on the basis of estimator uncertainty rather than estimated risk. The thin-file borrower is penalised for the model's ignorance, not for their own behaviour. This is an epistemic failure dressed up as a credit decision.

The scale is national. TransUnion CIBIL's financial-inclusion research found the credit-served population rose from roughly 91 million in 2017 to 179 million in 2021 — still only about 22% of the adult population. The same body of work classified on the order of 480 million adults as credit-unserved and a further ~164 million as credit-underserved. Credit penetration among 18–33-year-olds in rural and semi-urban India sits around 8%. The data-rails to reach these people already exist: 56.16 crore PMJDY accounts as of August 2025, two-thirds of them in rural and semi-urban areas and more than half held by women. The credit gap is therefore not a coverage gap in banking — it is a legibility gap in scoring. And it is widening at the margin: new-to-credit originations of consumption-led products fell 21% year-on-year in the quarter ending December 2024, as lenders retreated from precisely the segment that traditional models cannot price.

2. Alternative Data Signals — A Technical Characterisation

India has, almost uniquely, four mature public or quasi-public data layers from which thin-file signal can be engineered. Each must be specified along four axes: raw stream, derivable features, signal evidence, and governance requirement.

2.1 Account Aggregator (AA) — bank transaction data

The AA framework is the richest source. As of 2025 it had over 18 crore linked accounts and 24 crore-plus consent requests, with more than ₹1.6 lakh crore disbursed across 1.8 crore loan accounts through 780+ participating institutions. The raw stream is consented CASA transaction history. From it we engineer features such as avg_monthly_inflow, an income_volatility_index, EMI_payment_regularity, overdraft_utilisation_30d, and savings_rate_3mo.

Feature engineering here is non-trivial. Raw transaction narrations are unstructured; salary credits must be disambiguated from peer transfers, recurring obligations identified by periodicity rather than label. A defensible income-volatility feature is the coefficient of variation of monthly inflow:

$$\text{IVI} = \frac{\sigma(\text{inflow}{1:n})}{\mu(\text{inflow}{1:n})}$$

Signal evidence is strong: cash-flow regularity is a direct measurement of repayment capacity, not a proxy for it. Governance requirement: AA data is consent-bound and purpose-limited under the DPDP Act, 2023; derived features inherit the consent scope and cannot be silently repurposed across products.

2.2 GST Network — tax filing data

For the self-employed and MSME borrower — who is structurally invisible to salary-based models — the GSTN is the single most informative source. Features include a GST_filing_consistency_score, revenue_growth_trend, seasonal_revenue_stability, and compliance_regularity. Consistent GST filing is a powerful proxy for business health because it is costly to fake and costly to maintain: a business that files accurate returns on schedule, quarter after quarter, is demonstrating both operational continuity and a revealed preference for staying inside the formal system. The filing cadence itself — independent of the revenue figures — carries signal, because it reflects governance discipline at the firm level. Governance requirement: revenue features are commercially sensitive and concentration-prone; sectoral and geographic skew must be tested before deployment (see §3).

2.3 UPI / Digital Payments — transaction patterns

UPI offers transaction_frequency, a merchant_diversity_index, and recurring_payment_regularity. The attraction is volume and freshness. The limitation is proxy risk: UPI velocity measures economic activity, not repayment capacity or intent. High transaction frequency may reflect a thriving micro-merchant — or someone routing flows that never settle as income. UPI is best used as a corroborating signal and a liveness check, not as a primary capacity feature. Treated as the latter, it tends to reward digital intensity, which correlates with urbanity and age — exactly the kind of latent bias §3 addresses.

2.4 Telecom / Utility — behavioural consistency

Telecom and utility data yield bill_payment_regularity, prepaid_recharge_pattern, and tenure_stability. These are behavioural-consistency signals: a long, stable tenure with regular payments evidences reliability. The constraints are practical and legal. Access is fragmented across operators, data quality is uneven, and there is no unified consent rail equivalent to the AA. Prepaid recharge patterns — the modal case in low-income segments — are noisier than postpaid billing and easily misread as instability when they reflect deliberate cash management.

3. Why Alternative Data Without Governance Is Dangerous

The seductive error is to assume that because alternative data is different from bureau data, it is fairer. It is not, by default. Alternative signals can encode geographic, occupational, and socioeconomic structure that produces disparate impact even when no protected attribute is in the feature set.

The mechanism is proxy discrimination. Define a feature $f$ as a proxy for a protected attribute $p$ when the conditional probability of $p$ given $f$ materially exceeds its population base rate:

$$P(p \mid f) \gg P(p)$$

A merchant_diversity_index may correlate with urban residence; prepaid_recharge_pattern with income tier; GST sectoral codes with caste- or region-linked occupational clustering. None of these is demographic on its face. Each can carry demographic information into the decision. A model can be scrupulously blind to protected attributes and still systematically exclude the groups those attributes describe — the blindness guarantees only that the discrimination is undetectable from the feature list alone.

This is precisely why outcome monitoring at the cluster level — what we track as the Cluster Fairness Disparity Index (CFDI) — is not optional infrastructure but a load-bearing one. Aggregate fairness metrics hide segment-level harm: a model can satisfy a portfolio-wide parity test while approving one borrower cluster at half the rate of another with identical realised performance. CFDI monitors approval and outcome disparities per cluster, against statistical-significance thresholds, so that exclusion shows up as a measured quantity rather than an after-the-fact discovery. Without it, alternative data can silently reproduce the exact exclusion pattern of the bureau score it was meant to replace — while appearing more inclusive, because it touched more applicants. Inclusion measured by reach is not inclusion measured by outcome.

4. What Governed Alternative Data Deployment Looks Like

Responsible thin-file lending requires a governance stack that sits downstream of any scoring model and governs the consequences of decisions. Five components are load-bearing.

(a) Feature governance. Every alternative signal is proxy-tested before deployment. For each candidate feature $f$ and each monitored segment, we estimate $P(p \mid f)$ and flag features whose predictive contribution is substantially mediated by their correlation with protected structure. A feature that adds accuracy only through a proxy pathway is removed, not retained.

(b) CFDI monitoring. Segment-level outcome tracking, with significance thresholds, runs continuously in production — not as a quarterly audit. Disparity is detected when it emerges, not when a regulator asks.

(c) Local explainability. Each individual decision carries a SHAP attribution over its alternative features, so that a denial can be decomposed into the specific signals that drove it. This is both a regulatory expectation under the RBI's FREE-AI direction and a precondition for the next component.

(d) Causal recourse. Explanation is necessary but insufficient; the borrower needs actionable guidance. The distinction is causal, not cosmetic. "Improve your GST filing consistency" is valid recourse: the feature is mutable, and the borrower's action plausibly moves the outcome. "Increase your UPI transaction volume" is not valid direct recourse — it is gameable, weakly causal, and inviting it would corrupt the very signal it cites. Recourse must be restricted to features that are both mutable by the borrower and causally connected to creditworthiness.

(e) RVDR tracking. Alternative-data models retrain frequently as new data arrives. Recourse generated under one model version can be silently invalidated by the next — a borrower told to do X may find X no longer changes the decision. The Recourse Validity Decay Rate (RVDR) monitors what fraction of issued recourse remains valid across retraining cycles, so that the promises made to borrowers are tracked as a measurable obligation rather than forgotten at deployment.

These components, alongside our threshold-parity diagnostics (DTPG), form a layer that is agnostic to the scoring model. NomoCrit does not care whether the decision came from a bureau score, an AA cash-flow model, or an ensemble of all four data layers above. It governs the decision's consequences: who was excluded, whether the explanation is faithful, whether the recourse is actionable, and whether that recourse survives the next retrain.

The Rate-Limiting Factor

It is tempting to frame the thin-file opportunity as a modelling problem — that the unlock is a better classifier on richer data. It is not. The data exists. The models are well understood. What is missing is the infrastructure that lets a lender deploy alternative data and prove it is not reproducing exclusion, and demonstrate to the borrower and the regulator that every adverse decision is explained, contestable, and accompanied by recourse that actually works.

Until that governance layer is standard, every lender faces an asymmetric bet: large upside in reaching 400 million-plus underserved adults, against an unbounded and unmeasured downside of disparate-impact liability and regulatory action under the FREE-AI framework. Rational institutions respond to unmeasured downside by retreating — which is precisely the NTC contraction the market is already showing. Governance does not slow the opportunity down. It is the thing that makes the opportunity bankable. For India's next 300 million borrowers, the binding constraint is not the model. It is the layer that makes the model safe to deploy.

FAQ

What technically defines a thin-file borrower? A borrower whose credit-history feature vector is too low-dimensional for a bureau score to estimate default probability with acceptable variance — operationally, under six months of tradeline history and fewer than two active credit facilities.

Does a lack of credit history mean higher default risk? No. The absence of history does not raise a borrower's true default probability; it raises the variance of the model's estimate of it. Thin-file borrowers are typically penalised for model uncertainty, not measured risk.

Why isn't alternative data automatically fairer than bureau data? Because alternative signals can act as proxies for protected attributes — a feature $f$ is a proxy when $P(p \mid f) \gg P(p)$ — reproducing demographic exclusion even with no demographic features in the model.

What is CFDI and why does it matter for thin-file lending? The Cluster Fairness Disparity Index tracks approval and outcome disparities at the segment level against significance thresholds, catching cluster-level exclusion that portfolio-wide fairness metrics conceal.

Why is governance the constraint rather than the model? The data and models already exist. What gates deployment is the ability to prove an alternative-data model is not reproducing exclusion and that every decision is explained, contestable, and backed by durable recourse — a model-agnostic governance layer, not a better classifier.