Essential Python Libraries for Validating Machine Learning Models

Chapter 1: Introduction to Model Validation

As data scientists, our role often involves creating machine learning models to interpret data and tackle various business challenges. Regardless of whether we are constructing a straightforward or intricate model, validation is a critical step to assess the integrity of our efforts. We must evaluate every aspect, from data validation to methodology, and measure machine learning model metrics. Although numerous techniques exist for validating machine learning models, this article highlights three Python packages that streamline the validation process.

Chapter 2: Top 3 Python Packages for Machine Learning Validation

Section 2.1: Evidently

Evidently is an open-source Python library designed for analyzing and monitoring machine learning models. This tool focuses on creating an easily observable dashboard for monitoring model performance and detecting data drift. Although it is ideally suited for production environments, it can also be utilized during the development phase.

To validate our machine learning model development with Evidently, we would typically work with both reference and production datasets. For demonstration purposes, we can utilize split train and test data from Kaggle.

First, install the Evidently package:

pip install evidently

Once installed, we can check for data drift, a situation where the current data statistically differs from the reference data:

import pandas as pd

train = pd.read_csv('churn-bigml-80.csv')

test = pd.read_csv('churn-bigml-20.csv')

Next, we will preprocess the dataset by retaining only the numerical data:

train.drop(['State', 'International plan', 'Voice mail plan'], axis=1, inplace=True)

test.drop(['State', 'International plan', 'Voice mail plan'], axis=1, inplace=True)

train['Churn'] = train['Churn'].apply(lambda x: 1 if x else 0)

test['Churn'] = test['Churn'].apply(lambda x: 1 if x else 0)

Once the data is prepared, we can create our dashboard to identify any drift using Evidently:

from evidently.dashboard import Dashboard

from evidently.tabs import DataDriftTab

data_drift_report = Dashboard(tabs=[DataDriftTab()])

data_drift_report.calculate(train, test, column_mapping=None)

data_drift_report.save("reports/my_report.html")

You can view the dashboard report in a web browser. The dashboard illustrates the distribution of each feature and statistical tests for data drift. In this example, there was no significant drift detected between the training and testing datasets.

Evidently can also be used to develop a classification dashboard to monitor the health of machine learning models. For instance, let’s train a classification model with the data:

from sklearn.neighbors import KNeighborsClassifier

X_train = train.drop('Churn', axis=1)

X_test = test.drop('Churn', axis=1)

y_train = train['Churn']

y_test = test['Churn']

model = KNeighborsClassifier(n_neighbors=5)

model.fit(X_train, y_train)

train_predictions = model.predict(X_train)

test_predictions = model.predict(X_test)

X_train['target'] = y_train

X_train['prediction'] = train_predictions

X_test['target'] = y_test

X_test['prediction'] = test_predictions

To monitor the model’s performance, we need to map the columns used:

from evidently.pipeline.column_mapping import ColumnMapping

churn_column_mapping = ColumnMapping()

churn_column_mapping.target = 'target'

churn_column_mapping.prediction = 'prediction'

churn_column_mapping.numerical_features = train.drop('Churn', axis=1).columns

Finally, we set up the classifier monitoring dashboard:

from evidently.tabs.base_tab import Verbose

from evidently.tabs import ClassificationPerformanceTab

churn_model_performance_dashboard = Dashboard(tabs=[ClassificationPerformanceTab(verbose_level=Verbose.FULL)])

churn_model_performance_dashboard.calculate(X_train, X_test, column_mapping=churn_column_mapping)

churn_model_performance_dashboard.save("reports/classification_churn.html")

The monitoring dashboard provides insights into the model's performance, helping to identify discrepancies when new data is introduced. For more details on the available dashboards, refer to the documentation.

Video Description: This video covers essential Python libraries for beginners focusing on machine learning.

Section 2.2: Deepchecks

Deepchecks is another Python library that facilitates validating machine learning models with minimal code. It offers various APIs for detecting data drift, label drift, train-test comparisons, model evaluations, and more. Deepchecks is particularly useful during the research phase and prior to deploying your model.

To generate a complete dataset and model performance report using Deepchecks, we can utilize the full_suite class:

First, install Deepchecks:

pip install deepchecks

Next, prepare the training dataset and machine learning model. In our example, we will use the well-known Iris dataset:

import pandas as pd

from deepchecks.datasets.classification import iris

from sklearn.ensemble import RandomForestClassifier

# Load Data

iris_df = iris.load_data(data_format='Dataframe', as_train_test=False)

df_train, df_test = iris.load_data(data_format='Dataframe', as_train_test=True)

label_col = "target"

rf_clf = iris.load_fitted_model()

Deepchecks works best when transforming Pandas DataFrames into its dataset objects:

from deepchecks import Dataset

ds_train = Dataset(df_train, label=label_col, cat_features=[])

ds_test = Dataset(df_test, label=label_col, cat_features=[])

Now, run the full suite report:

from deepchecks.suites import full_suite

suite = full_suite()

suite.run(train_dataset=ds_train, test_dataset=ds_test, model=rf_clf)

The comprehensive report includes metrics like the Confusion Matrix, Model Comparison, Data Drift, and more, all generated with a single command.

Video Description: This video focuses on model selection techniques in machine learning using Python.

Section 2.3: TensorFlow Data Validation

TensorFlow Data Validation (TFDV) is a library developed by TensorFlow to address data quality issues. It automatically describes data statistics, infers schemas, and identifies anomalies in incoming data.

To start, install TFDV:

pip install tensorflow-data-validation

We can import the package and create a statistical object from our CSV data:

import tensorflow_data_validation as tfdv

stats = tfdv.generate_statistics_from_csv(data_location='churn-bigml-80.csv')

tfdv.visualize_statistics(stats)

TFDV not only generates statistical visualizations but is also effective in detecting changes in incoming data. To validate incoming data against the inferred schema, we run:

schema = tfdv.infer_schema(stats)

tfdv.display_schema(schema)

new_csv_stats = tfdv.generate_statistics_from_csv(data_location='churn-bigml-20.csv')

anomalies = tfdv.validate_statistics(statistics=new_csv_stats, schema=schema)

tfdv.display_anomalies(anomalies)

In this case, no anomalies were detected, indicating that the incoming data closely resembles the reference data.

Chapter 3: Conclusion

Machine learning model projects are ongoing endeavors requiring constant monitoring and validation. To assist in this process, we can leverage the following Python packages:

Evidently
Deepchecks
TensorFlow Data Validation

I hope you find this guide helpful! For further insights into data science or to follow my journey as a data scientist, consider connecting with me on LinkedIn or Twitter. If you appreciate this content and seek deeper knowledge, please subscribe to my newsletter.

zhaopinboai.com

Essential Python Libraries for Validating Machine Learning Models

Chapter 1: Introduction to Model Validation

Chapter 2: Top 3 Python Packages for Machine Learning Validation

Section 2.1: Evidently

Section 2.2: Deepchecks

Section 2.3: TensorFlow Data Validation

Chapter 3: Conclusion

Share the page:

Recent Post:

Five Simple Ways to Enhance Your Life and Save Money

Understanding the Deceptive Nature of Narcissists and Their New Supply

Revolutionizing Urban Travel with Alpha Neo Hydrogen E-Bike

Preparing for Abundance: Embracing a Life of Plenty

Essential Insights for Healing from Trauma: A Guide

Mastering the Art of Writing: Why It Matters in Today's World

Navigating Betrayal: Steps to Self-Empowerment and Healing

# Discovering Hidden Gems: 10 Amazing Showcases on Medium