Essential Python Libraries for Validating Machine Learning Models
Written on
Chapter 1: Introduction to Model Validation
As data scientists, our role often involves creating machine learning models to interpret data and tackle various business challenges. Regardless of whether we are constructing a straightforward or intricate model, validation is a critical step to assess the integrity of our efforts. We must evaluate every aspect, from data validation to methodology, and measure machine learning model metrics. Although numerous techniques exist for validating machine learning models, this article highlights three Python packages that streamline the validation process.
Chapter 2: Top 3 Python Packages for Machine Learning Validation
Section 2.1: Evidently
Evidently is an open-source Python library designed for analyzing and monitoring machine learning models. This tool focuses on creating an easily observable dashboard for monitoring model performance and detecting data drift. Although it is ideally suited for production environments, it can also be utilized during the development phase.
To validate our machine learning model development with Evidently, we would typically work with both reference and production datasets. For demonstration purposes, we can utilize split train and test data from Kaggle.
First, install the Evidently package:
pip install evidently
Once installed, we can check for data drift, a situation where the current data statistically differs from the reference data:
import pandas as pd
train = pd.read_csv('churn-bigml-80.csv')
test = pd.read_csv('churn-bigml-20.csv')
Next, we will preprocess the dataset by retaining only the numerical data:
train.drop(['State', 'International plan', 'Voice mail plan'], axis=1, inplace=True)
test.drop(['State', 'International plan', 'Voice mail plan'], axis=1, inplace=True)
train['Churn'] = train['Churn'].apply(lambda x: 1 if x else 0)
test['Churn'] = test['Churn'].apply(lambda x: 1 if x else 0)
Once the data is prepared, we can create our dashboard to identify any drift using Evidently:
from evidently.dashboard import Dashboard
from evidently.tabs import DataDriftTab
data_drift_report = Dashboard(tabs=[DataDriftTab()])
data_drift_report.calculate(train, test, column_mapping=None)
data_drift_report.save("reports/my_report.html")
You can view the dashboard report in a web browser. The dashboard illustrates the distribution of each feature and statistical tests for data drift. In this example, there was no significant drift detected between the training and testing datasets.
Evidently can also be used to develop a classification dashboard to monitor the health of machine learning models. For instance, let’s train a classification model with the data:
from sklearn.neighbors import KNeighborsClassifier
X_train = train.drop('Churn', axis=1)
X_test = test.drop('Churn', axis=1)
y_train = train['Churn']
y_test = test['Churn']
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
train_predictions = model.predict(X_train)
test_predictions = model.predict(X_test)
X_train['target'] = y_train
X_train['prediction'] = train_predictions
X_test['target'] = y_test
X_test['prediction'] = test_predictions
To monitor the model’s performance, we need to map the columns used:
from evidently.pipeline.column_mapping import ColumnMapping
churn_column_mapping = ColumnMapping()
churn_column_mapping.target = 'target'
churn_column_mapping.prediction = 'prediction'
churn_column_mapping.numerical_features = train.drop('Churn', axis=1).columns
Finally, we set up the classifier monitoring dashboard:
from evidently.tabs.base_tab import Verbose
from evidently.tabs import ClassificationPerformanceTab
churn_model_performance_dashboard = Dashboard(tabs=[ClassificationPerformanceTab(verbose_level=Verbose.FULL)])
churn_model_performance_dashboard.calculate(X_train, X_test, column_mapping=churn_column_mapping)
churn_model_performance_dashboard.save("reports/classification_churn.html")
The monitoring dashboard provides insights into the model's performance, helping to identify discrepancies when new data is introduced. For more details on the available dashboards, refer to the documentation.
Video Description: This video covers essential Python libraries for beginners focusing on machine learning.
Section 2.2: Deepchecks
Deepchecks is another Python library that facilitates validating machine learning models with minimal code. It offers various APIs for detecting data drift, label drift, train-test comparisons, model evaluations, and more. Deepchecks is particularly useful during the research phase and prior to deploying your model.
To generate a complete dataset and model performance report using Deepchecks, we can utilize the full_suite class:
First, install Deepchecks:
pip install deepchecks
Next, prepare the training dataset and machine learning model. In our example, we will use the well-known Iris dataset:
import pandas as pd
from deepchecks.datasets.classification import iris
from sklearn.ensemble import RandomForestClassifier
# Load Data
iris_df = iris.load_data(data_format='Dataframe', as_train_test=False)
df_train, df_test = iris.load_data(data_format='Dataframe', as_train_test=True)
label_col = "target"
rf_clf = iris.load_fitted_model()
Deepchecks works best when transforming Pandas DataFrames into its dataset objects:
from deepchecks import Dataset
ds_train = Dataset(df_train, label=label_col, cat_features=[])
ds_test = Dataset(df_test, label=label_col, cat_features=[])
Now, run the full suite report:
from deepchecks.suites import full_suite
suite = full_suite()
suite.run(train_dataset=ds_train, test_dataset=ds_test, model=rf_clf)
The comprehensive report includes metrics like the Confusion Matrix, Model Comparison, Data Drift, and more, all generated with a single command.
Video Description: This video focuses on model selection techniques in machine learning using Python.
Section 2.3: TensorFlow Data Validation
TensorFlow Data Validation (TFDV) is a library developed by TensorFlow to address data quality issues. It automatically describes data statistics, infers schemas, and identifies anomalies in incoming data.
To start, install TFDV:
pip install tensorflow-data-validation
We can import the package and create a statistical object from our CSV data:
import tensorflow_data_validation as tfdv
stats = tfdv.generate_statistics_from_csv(data_location='churn-bigml-80.csv')
tfdv.visualize_statistics(stats)
TFDV not only generates statistical visualizations but is also effective in detecting changes in incoming data. To validate incoming data against the inferred schema, we run:
schema = tfdv.infer_schema(stats)
tfdv.display_schema(schema)
new_csv_stats = tfdv.generate_statistics_from_csv(data_location='churn-bigml-20.csv')
anomalies = tfdv.validate_statistics(statistics=new_csv_stats, schema=schema)
tfdv.display_anomalies(anomalies)
In this case, no anomalies were detected, indicating that the incoming data closely resembles the reference data.
Chapter 3: Conclusion
Machine learning model projects are ongoing endeavors requiring constant monitoring and validation. To assist in this process, we can leverage the following Python packages:
- Evidently
- Deepchecks
- TensorFlow Data Validation
I hope you find this guide helpful! For further insights into data science or to follow my journey as a data scientist, consider connecting with me on LinkedIn or Twitter. If you appreciate this content and seek deeper knowledge, please subscribe to my newsletter.