zhaopinboai.com

Essential Python Libraries for Validating Machine Learning Models

Written on

Chapter 1: Introduction to Model Validation

As data scientists, our role often involves creating machine learning models to interpret data and tackle various business challenges. Regardless of whether we are constructing a straightforward or intricate model, validation is a critical step to assess the integrity of our efforts. We must evaluate every aspect, from data validation to methodology, and measure machine learning model metrics. Although numerous techniques exist for validating machine learning models, this article highlights three Python packages that streamline the validation process.

Chapter 2: Top 3 Python Packages for Machine Learning Validation

Section 2.1: Evidently

Evidently is an open-source Python library designed for analyzing and monitoring machine learning models. This tool focuses on creating an easily observable dashboard for monitoring model performance and detecting data drift. Although it is ideally suited for production environments, it can also be utilized during the development phase.

To validate our machine learning model development with Evidently, we would typically work with both reference and production datasets. For demonstration purposes, we can utilize split train and test data from Kaggle.

First, install the Evidently package:

pip install evidently

Once installed, we can check for data drift, a situation where the current data statistically differs from the reference data:

import pandas as pd

train = pd.read_csv('churn-bigml-80.csv')

test = pd.read_csv('churn-bigml-20.csv')

Next, we will preprocess the dataset by retaining only the numerical data:

train.drop(['State', 'International plan', 'Voice mail plan'], axis=1, inplace=True)

test.drop(['State', 'International plan', 'Voice mail plan'], axis=1, inplace=True)

train['Churn'] = train['Churn'].apply(lambda x: 1 if x else 0)

test['Churn'] = test['Churn'].apply(lambda x: 1 if x else 0)

Once the data is prepared, we can create our dashboard to identify any drift using Evidently:

from evidently.dashboard import Dashboard

from evidently.tabs import DataDriftTab

data_drift_report = Dashboard(tabs=[DataDriftTab()])

data_drift_report.calculate(train, test, column_mapping=None)

data_drift_report.save("reports/my_report.html")

You can view the dashboard report in a web browser. The dashboard illustrates the distribution of each feature and statistical tests for data drift. In this example, there was no significant drift detected between the training and testing datasets.

Evidently can also be used to develop a classification dashboard to monitor the health of machine learning models. For instance, let’s train a classification model with the data:

from sklearn.neighbors import KNeighborsClassifier

X_train = train.drop('Churn', axis=1)

X_test = test.drop('Churn', axis=1)

y_train = train['Churn']

y_test = test['Churn']

model = KNeighborsClassifier(n_neighbors=5)

model.fit(X_train, y_train)

train_predictions = model.predict(X_train)

test_predictions = model.predict(X_test)

X_train['target'] = y_train

X_train['prediction'] = train_predictions

X_test['target'] = y_test

X_test['prediction'] = test_predictions

To monitor the model’s performance, we need to map the columns used:

from evidently.pipeline.column_mapping import ColumnMapping

churn_column_mapping = ColumnMapping()

churn_column_mapping.target = 'target'

churn_column_mapping.prediction = 'prediction'

churn_column_mapping.numerical_features = train.drop('Churn', axis=1).columns

Finally, we set up the classifier monitoring dashboard:

from evidently.tabs.base_tab import Verbose

from evidently.tabs import ClassificationPerformanceTab

churn_model_performance_dashboard = Dashboard(tabs=[ClassificationPerformanceTab(verbose_level=Verbose.FULL)])

churn_model_performance_dashboard.calculate(X_train, X_test, column_mapping=churn_column_mapping)

churn_model_performance_dashboard.save("reports/classification_churn.html")

The monitoring dashboard provides insights into the model's performance, helping to identify discrepancies when new data is introduced. For more details on the available dashboards, refer to the documentation.

Video Description: This video covers essential Python libraries for beginners focusing on machine learning.

Section 2.2: Deepchecks

Deepchecks is another Python library that facilitates validating machine learning models with minimal code. It offers various APIs for detecting data drift, label drift, train-test comparisons, model evaluations, and more. Deepchecks is particularly useful during the research phase and prior to deploying your model.

To generate a complete dataset and model performance report using Deepchecks, we can utilize the full_suite class:

First, install Deepchecks:

pip install deepchecks

Next, prepare the training dataset and machine learning model. In our example, we will use the well-known Iris dataset:

import pandas as pd

from deepchecks.datasets.classification import iris

from sklearn.ensemble import RandomForestClassifier

# Load Data

iris_df = iris.load_data(data_format='Dataframe', as_train_test=False)

df_train, df_test = iris.load_data(data_format='Dataframe', as_train_test=True)

label_col = "target"

rf_clf = iris.load_fitted_model()

Deepchecks works best when transforming Pandas DataFrames into its dataset objects:

from deepchecks import Dataset

ds_train = Dataset(df_train, label=label_col, cat_features=[])

ds_test = Dataset(df_test, label=label_col, cat_features=[])

Now, run the full suite report:

from deepchecks.suites import full_suite

suite = full_suite()

suite.run(train_dataset=ds_train, test_dataset=ds_test, model=rf_clf)

The comprehensive report includes metrics like the Confusion Matrix, Model Comparison, Data Drift, and more, all generated with a single command.

Video Description: This video focuses on model selection techniques in machine learning using Python.

Section 2.3: TensorFlow Data Validation

TensorFlow Data Validation (TFDV) is a library developed by TensorFlow to address data quality issues. It automatically describes data statistics, infers schemas, and identifies anomalies in incoming data.

To start, install TFDV:

pip install tensorflow-data-validation

We can import the package and create a statistical object from our CSV data:

import tensorflow_data_validation as tfdv

stats = tfdv.generate_statistics_from_csv(data_location='churn-bigml-80.csv')

tfdv.visualize_statistics(stats)

TFDV not only generates statistical visualizations but is also effective in detecting changes in incoming data. To validate incoming data against the inferred schema, we run:

schema = tfdv.infer_schema(stats)

tfdv.display_schema(schema)

new_csv_stats = tfdv.generate_statistics_from_csv(data_location='churn-bigml-20.csv')

anomalies = tfdv.validate_statistics(statistics=new_csv_stats, schema=schema)

tfdv.display_anomalies(anomalies)

In this case, no anomalies were detected, indicating that the incoming data closely resembles the reference data.

Chapter 3: Conclusion

Machine learning model projects are ongoing endeavors requiring constant monitoring and validation. To assist in this process, we can leverage the following Python packages:

  • Evidently
  • Deepchecks
  • TensorFlow Data Validation

I hope you find this guide helpful! For further insights into data science or to follow my journey as a data scientist, consider connecting with me on LinkedIn or Twitter. If you appreciate this content and seek deeper knowledge, please subscribe to my newsletter.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Five Simple Ways to Enhance Your Life and Save Money

Discover five easy actions you can take at home to save money, improve your health, and boost your overall life satisfaction.

Understanding the Deceptive Nature of Narcissists and Their New Supply

Discover how narcissists manipulate their relationships and hide their true selves while managing multiple partners.

Revolutionizing Urban Travel with Alpha Neo Hydrogen E-Bike

Discover the Alpha Neo E-Bike, a hydrogen fuel-powered solution for efficient urban travel without the hassle of charging.

Preparing for Abundance: Embracing a Life of Plenty

Explore how to shift from a mindset of scarcity to one of abundance and how to prepare for the blessings to come.

Essential Insights for Healing from Trauma: A Guide

Discover essential tips for healing from trauma and improving your emotional well-being.

Mastering the Art of Writing: Why It Matters in Today's World

Explore the significance of honing your writing skills in a world dominated by AI-generated content.

Navigating Betrayal: Steps to Self-Empowerment and Healing

Discover effective strategies to overcome betrayal and cultivate self-love for healthier relationships.

# Discovering Hidden Gems: 10 Amazing Showcases on Medium

Explore ten remarkable articles on Medium that deserve recognition and learn how to join this vibrant writing community.