zhaopinboai.com

A Comprehensive Overview of Text Cleaning Techniques in Python

Written on

Introduction to Natural Language Processing

Text is categorized as unstructured data. According to Wikipedia, unstructured data refers to "information that either does not have a pre-defined data model or is not organized in a pre-defined manner."

Machines, unlike humans, lack the ability to interpret raw text intuitively. When handling textual data, we cannot directly input raw text into a machine learning model. We must first clean the text and then convert it into a machine-readable format.

In this article, we will explore various methods for text cleaning. Further techniques for text encoding will be discussed in a subsequent post.

Normalizing Case

In writing, capitalizing words serves specific purposes, such as starting a sentence or identifying proper nouns.

While humans can discern that "The" at the beginning of a sentence is the same as "the" used later, machines interpret these as distinct tokens. Therefore, normalizing the case ensures that words are treated uniformly, preventing duplication in token processing.

# Python Example text = "The UK lockdown restrictions will be dropped in the summer so we can go partying again!" # Lowercasing the text text = text.lower() print(text) Output: the uk lockdown restrictions will be dropped in the summer so we can go partying again!

Eliminating Stopwords

In numerous natural language tasks, it is crucial for machine learning models to recognize the key words that contribute meaning to a document. For instance, in sentiment analysis, identifying words that influence sentiment is essential.

In English (and likely other languages), many common words do not enhance the meaning of sentences. Thus, it can be advantageous to remove these stopwords from our text.

> Note: Be cautious; removing stopwords may not always be beneficial!

# Importing necessary libraries import nltk from nltk.corpus import stopwords nltk.download("stopwords")

stop_words = set(stopwords.words("english")) print(stop_words)

Example output might look like: {'over', 'is', 'than', 'can', 'these', ...}

# Example text text = "The UK lockdown restrictions will be dropped in the summer so we can go partying again!" # Removing stopwords text = " ".join([word for word in text.split() if word not in stop_words]) print(text) Output: uk lockdown restrictions dropped summer go partying again!

Removing Unicode Characters

Unicode is a universal character encoding standard, allowing characters from various languages to be represented. However, it can be unreadable in ASCII format.

> Note: Example code adapted from Python Guides.

# Creating a unicode string text_unicode = "Python is easy u200c to learn" # Encoding to ASCII text_encode = text_unicode.encode(encoding="ascii", errors="ignore") # Decoding the text text_decode = text_encode.decode() # Cleaning the text to remove extra whitespace clean_text = " ".join([word for word in text_decode.split()]) print(clean_text) Output: Python is easy to learn.

Stripping URLs, Hashtags, and Punctuation

Depending on the data source, we may encounter various forms of noise, such as hashtags and mentions on social media platforms. If these elements do not aid our analysis, it's better to remove them. Regular expressions (Regex) can assist in identifying these patterns.

import re

# Removing mentions text = "You should get @BlockFiZac from @BlockFi to talk about bitcoin lending." text = re.sub("@S+", "", text) print(text) Output: You should get from to talk about bitcoin lending.

Further examples can show the removal of dollar signs, URLs, hashtags, and punctuation using similar methods.

Stemming and Lemmatization

In NLP tasks, it may be necessary for the computer to recognize that variations of a word, such as "walked," "walk," and "walking," are forms of the same base word. Stemming and lemmatization are techniques that help achieve this normalization.

Definitions from Wikipedia: - Stemming: The process of reducing inflected words to their root form. - Lemmatization: The grouping of inflected forms of a word so they can be analyzed as a single item.

import nltk from nltk.stem.porter import PorterStemmer from nltk.stem import WordNetLemmatizer

words = ["walk", "walking", "walked", "walks", "ran", "run", "running", "runs"]

# Example of stemming stemmer = PorterStemmer() for word in words:

print(word + " ---> " + stemmer.stem(word))

# Example of lemmatization lemmatizer = WordNetLemmatizer() for word in words:

print(word + " ---> " + lemmatizer.lemmatize(word))

Conclusion

This tutorial has covered methods for cleaning text in Python. Key points discussed include: - The importance of cleaning text - Various techniques for text cleaning

Thank you for reading! Connect with me on LinkedIn and Twitter for updates on Data Science, Artificial Intelligence, and Freelancing.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Understanding the Emotional Unavailability of Men in Dating

Discover why many men struggle with emotional availability and how to navigate modern dating challenges.

Exploring Diverse Christian Views on Creation

This article discusses various Christian perspectives on creation, including Young Earth Creationism, Gap Theory, and more.

Empowering Employees: The Importance of Financial Wellness

Explore the significance of financial wellness programs in the workplace and how they can transform employee lives and productivity.

Does a Founder’s Personality Influence Startup Success?

Exploring how the personality traits of founders correlate with the success of their startups.

Writing Gender-Inclusive Emails: Seven Strategies for Success

Discover how to write gender-inclusive emails with seven effective tips to connect with diverse audiences and enhance engagement.

Exploring the Subjectivity of Art and Science: A Deep Dive

An analysis of the subjective nature of aesthetics and science, exploring their parallels and differences in perception and value.

Exploring Our Understanding of God and Faith

A reflection on the nature of God, our understanding of divinity, and the evolution of faith through science.

Unlocking Business Success Through a Creative Culture

Discover how fostering creativity can lead to better business outcomes and a thriving workplace.