From Loss to Profit: How AI Helped an NBFC Predict Loan Defaults with 87% Accuracy
- Loren Cossette
- May 20
- 5 min read
In today's financial landscape, non-banking financial companies (NBFCs) play a crucial role in providing credit access to underserved segments. However, with broader lending comes increased risk. This article explores how data science transformed one NBFC's approach to loan default prediction, resulting in significant improvements to their risk assessment capabilities.

The Challenge: Rising Defaults in a Competitive Market
DHDL Ltd., an Indian NBFC, faced a growing challenge: rising loan defaults were eroding profitability. Unlike traditional banks, NBFCs often serve clients with limited credit history or those seeking faster loan processing. This competitive advantage comes with a cost—higher default rates.
With a dataset of 90,000+ past loan records, DHDL needed to answer two critical questions:
Which clients are most likely to default?
What factors most strongly predict default?
The Data Science Approach
Data Exploration Reveals Hidden Insights
Our first step was comprehensive data exploration. The dataset contained diverse information:
Client demographics: Income, job experience, home ownership
Loan characteristics: Amount, term, grade, interest rate, purpose
Credit metrics: Delinquencies, revolving balances, account totals
One striking discovery was that approximately 8% of records had missing information about revolving credit accounts. This seemingly innocuous data gap turned out to be one of the most powerful predictors: clients with missing revolving account information defaulted at a rate of nearly 80%, compared to just 19% for those with complete information.
python
# Create a binary flag for missing revolving info
df['missing_revolving_info'] = df['total_current_balance'].isnull().astype(int)
# Analyze default rate by missing flag
default_by_missing = df.groupby('missing_revolving_info')['default'].mean()
print(f"Default rate with complete data: {default_by_missing[0]:.2%}")
print(f"Default rate with missing data: {default_by_missing[1]:.2%}")
Data Preprocessing: Transforming Raw Data into Model-Ready Features
The preprocessing pipeline included:
Data cleaning: Converting text values to numeric (e.g., "3 years" → 3)
Missing value handling: Creating flags for systematically missing data
Feature engineering: Log-transforming skewed financial variables; Encoding categorical variables; Creating interaction terms
python
# Example of log transformation for skewed variables
skewed_cols = ['annual_income', 'revolving_balance', 'total_current_balance']
for col in skewed_cols:
df[f'{col}_log'] = np.where(df[col] <= 0, 0, np.log1p(df[col]))
Visualizing the relationship between loan grade and default rate clearly showed increased risk with lower grades.
Machine Learning: Finding the Right Model
We developed multiple models to predict defaults:
1. Logistic Regression: Provided interpretable coefficients
Accuracy: 77.8%
AUC: 0.83
Key insight: Strong coefficient for the missing revolving info flag
2. Random Forest: Emphasized high precision
Accuracy: 82.4%
Precision for defaults: 85%
Useful for conservative lending scenarios
3. XGBoost: Our top performer
Accuracy: 86.9%
AUC: 0.91
F1 Score: 0.74
Balanced precision (70%) and recall (78%)
4. Neural Network: Competitive deep learning approach
Accuracy: 86.6%
Required threshold optimization for best performance
5. Ensemble Model: Combined the strengths of all models
Slightly improved overall robustness
More consistent across different client segments
python
# XGBoost implementation example
model = XGBClassifier(
n_estimators=200,
max_depth=6,
learning_rate=0.1,
# Handle class imbalance
scale_pos_weight=(len(y_train) - sum(y_train)) / sum(y_train),
random_state=42
)
model.fit(X_train, y_train)
Feature Importance: What Really Drives Defaults?
The top five predictors of loan default:
Missing revolving account information: The strongest single predictor
Loan grade: Grades F and G indicated substantially higher risk
Interest rate: Higher rates strongly correlated with defaults
Debt-to-income ratio: Higher ratios increased default probability
Loan purpose: Certain purposes (especially debt consolidation) showed elevated risk
Below is a simplified feature importance chart from our XGBoost model:
Feature Importance
----------------------------------
missing_revolving_info 29.7%
loan_grade_G 12.3%
loan_grade_F 9.8%
interest_rate 7.6%
debt_to_income 6.4%
loan_purpose_debt_cons 5.2%
...
Technical Deep Dive: Why XGBoost Performed Best
For the technically inclined, XGBoost's superior performance can be attributed to several factors:
Handling of non-linear relationships: Unlike logistic regression, XGBoost captures complex interactions between variables without explicit feature engineering.
Robustness to multicollinearity: Many financial variables correlate with each other, which challenges linear models but doesn't significantly impact tree-based methods.
Automatic handling of missing values: XGBoost has built-in mechanisms for dealing with missing data that proved effective with our dataset.
Gradient boosting advantage: The sequential improvement of weak learners allows XGBoost to focus on hard-to-classify cases.
Our implementation included careful hyperparameter tuning:
python
# Hyperparameter grid (simplified)
param_grid = {
'max_depth': [3, 6, 9],
'learning_rate': [0.1, 0.01],
'n_estimators': [100, 200, 300],
'subsample': [0.8, 1.0],
'colsample_bytree': [0.8, 1.0]
}
Implementation and Business Impact
From Model to Business Value
The prediction model was implemented as part of the loan approval workflow, providing:
Probability scores: Each applicant receives a default probability from 0-100%
Risk factors: Specific elements in each application that contribute to risk
Visualization tools: Dashboards showing risk distribution across the portfolio
API integration: Seamless connection with existing loan processing systems
Measurable Business Outcomes
Based on validation data and initial implementation, DHDL can expect:
Reduced default rate: Potentially by 20-30% by adjusting approval thresholds
Improved lending efficiency: 87% accuracy means fewer manual reviews
Better risk-based pricing: More precise interest rates based on actual risk
Portfolio insight: Clear understanding of which factors drive defaults
Recommendations for NBFCs
Based on our findings, here are key recommendations for any NBFC looking to improve default prediction:
Pay attention to missing data patterns: Sometimes what's missing tells you more than what's present.
Combine models for robust predictions: Different models capture different aspects of risk.
Look beyond traditional credit metrics: Loan purpose and application completeness can be powerful predictors.
Implement threshold optimization: The standard 50% threshold is rarely optimal for classification problems with class imbalance.
Balance precision and recall based on business objectives: Conservative lenders might prioritize precision (avoiding bad loans), while growth-focused lenders might emphasize recall (not missing good loans).
The Technical-Business Bridge
What makes this solution powerful is the bridge it creates between technical sophistication and business utility:
Model explainability: Despite using advanced algorithms, results are presented in business terms
Implementation flexibility: The modular design allows for regular updates and improvements
Risk-based insights: Beyond just predicting defaults, the model provides strategic portfolio insights
Conclusion: Data Science as a Competitive Advantage
In the competitive NBFC market, sophisticated default prediction is no longer optional—it's essential for sustainable profitability. By leveraging machine learning effectively, DHDL transformed its approach to risk assessment.
The most valuable insight went beyond the 87% accuracy figure: understanding which factors truly drive defaults enables strategic adjustments to lending policies, potentially transforming a struggling portfolio into a profitable one.
For NBFCs and traditional lenders alike, the message is clear: modern machine learning approaches, properly implemented with domain knowledge, can provide a significant competitive edge in risk assessment and portfolio management.
This article describes an anonymized case study based on real work in the NBFC sector. The solution employed Python with scikit-learn, XGBoost, and TensorFlow to build and validate the prediction models.
Comments