How to Solve Class Imbalance in Machine Learning?

Get in Touch With Us

Submitting the form below will ensure a prompt response from us.

Machine learning models are only as good as the data on which they are trained. When training data is skewed heavily toward one class, models often struggle to generalize and perform poorly on the minority class. This issue is commonly referred to as Class Imbalance in Machine Learning.

In real-world scenarios, such as fraud detection, medical diagnosis, and spam filtering, class imbalance is a common issue that requires careful handling to prevent biased predictions.

What is Class Imbalance?

Class imbalance occurs when the distribution of classes in a dataset is uneven. For example, in a fraud detection dataset:

Fraudulent transactions (minority class): 1%
Non-fraudulent transactions (majority class): 99%

If a model naively predicts “non-fraud” every time, it could achieve 99% accuracy but fail at detecting fraud, which is the actual goal.

Why is Class Imbalance a Problem?

When datasets are imbalanced:

Accuracy can become misleading: High accuracy may hide poor recall for the minority class.
Bias toward the majority class: Models prioritize predicting the dominant class.
Poor generalization: The model might fail to detect rare but critical cases (like fraud or disease).

Techniques to Handle Class Imbalance in Machine Learning

Resampling Methods

Resampling changes the dataset distribution to balance the classes.

a) Oversampling (e.g., SMOTE – Synthetic Minority Oversampling Technique)

Increases the size of the minority class by creating synthetic samples.

from imblearn.over_sampling import SMOTE
from collections import Counter
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=10, weights=[0.9, 0.1])
print("Before SMOTE:", Counter(y))

sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)
print("After SMOTE:", Counter(y_res))

b) Undersampling

Reduces the majority class to match the minority class.

from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X, y)
print("After Undersampling:", Counter(y_res))

Using Different Evaluation Metrics

Accuracy is not enough. Alternative metrics include:

Precision – How many predicted positives are corrected.
Recall (Sensitivity) – How many actual positives are detected.
F1-Score – Balance between precision and recall.
ROC-AUC – Measures model performance across thresholds.

Example in scikit-learn:

from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

Cost-sensitive Learning

Assigns higher misclassification costs to the minority class, encouraging the model to prioritize correct predictions for that class.

In scikit-learn, this can be applied by adjusting the class_weight parameter:

model = LogisticRegression(class_weight='balanced')
model.fit(X_train, y_train)

Ensemble Methods

Ensemble techniques, such as Random Forest, XGBoost, and Balanced Bagging Classifier, perform well on imbalanced datasets.

Example with XGBoost:

import xgboost as xgb
model = xgb.XGBClassifier(scale_pos_weight=10)  # adjust weight for minority class
model.fit(X_train, y_train)

Data Collection and Feature Engineering

Sometimes the solution is better data. Collecting more samples of the minority class or creating better features can improve class representation.

Real-world Examples of Class Imbalance

Healthcare: Diagnosing rare diseases from patient data.
Finance: Detecting fraudulent credit card transactions.
Cybersecurity: Identifying malicious network traffic.
E-commerce: Predicting rare customer churn.

Best Practices to Handle Class Imbalance

Always start with exploratory data analysis to check class distribution.
Avoid relying solely on accuracy as a measure of performance.
Use resampling techniques or cost-sensitive models when the imbalance is severe.
Validate results with cross-validation to avoid overfitting.
Keep in mind that, in a business context, correctly detecting the minority class is sometimes more important than achieving overall accuracy.

Struggling with Class Imbalance in Your Models?

We help teams build accurate, fair, and production-ready machine learning models by solving issues like class imbalance, bias, and low recall.

Talk to Our ML Experts

Conclusion

Class imbalance in machine learning poses a significant challenge in building fair and reliable models. Techniques like SMOTE, undersampling, cost-sensitive learning, and ensemble methods can help address the imbalance. Most importantly, evaluating models with proper metrics such as precision, recall, and F1-score ensures that the minority class is not overlooked.

By carefully handling imbalanced datasets, businesses can make more accurate predictions in critical areas, such as fraud detection, medical diagnosis, and security monitoring.

About Author

Jayanti Katariya is the CEO of Moon Technolabs, a fast-growing IT solutions provider, with 18+ years of experience in the industry. Passionate about developing creative apps from a young age, he pursued an engineering degree to further this interest. Under his leadership, Moon Technolabs has helped numerous brands establish their online presence and he has also launched an invoicing software that assists businesses to streamline their financial operations.