How to do Text Preprocessing using Python NLTK

How to do Text Preprocessing using Python NLTK

Introduction

NLTK(Natural Language Toolkit) is the most popular and widely used python library for doing Natural Language Processing(NLP) or Text Mining. NLP is one of the important parts of Artificial Intelligence(AI) that focuses on teaching computers how to extract meaning from data.

Due to the rapid growth in usage of the Internet, huge amounts of data(in the form of text, audio, image, and video) is generated on a daily basis. So to derive insights from those data first we have to preprocess the data before transferring them to the machine learning model.

Apart from NLTK, there are packages in python which can be used for NLP like spaCy, Gensim,  Polyglot, Textblob, and Pattern.

 

Installation of NLTK

To install NLTK package, you have to run following command in your terminal:

$ pip install nltk

 

Steps for Text Preprocessing are:

  1. Convert text into lowercase
  2. Tokenizing
  3. Removing Noise
  4. Stemming

Here is the sample text for preprocessing:

Charles Babbage, who was born in 1791, is regarded as the father of computing because of his research into machines that could calculate.

 

1. Convert text into lowercase

This is one of the important steps in Natural Language Processing. In order to treat two different words like “nltk” and “NLTK” the same, we have to convert the text in whatever format into lowercase first.

We can simply use the inbuilt function lower() provided by python to convert text into lowercase.

text = "Charles Babbage, who was born in 1791, is regarded as the father of computing because of his research into machines that could calculate."

text = text.lower()

print(text)

Output:

'charles babbage, who was born in 1791, is regarded as the father of computing because of his research into machines that could calculate.'

 

2. Tokenizing

Tokenization is the process of splitting text into chunks of words or sentences that helps in analyzing the sequence of words in the text.

For the sentence-level tokenizing, we can use sent_tokenize function provided by NLTK as:

from nltk.tokenize import sent_tokenize

text = "Python is an interpreted high-level general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant indentation."

sentence_tokenize = sent_tokenize(text)

print(sentence_tokenize)

Output:

['Python is an interpreted high-level general-purpose programming language.', "Python's design philosophy emphasizes code readability with its notable use of significant indentation."]

Note: If any error occur during the execution of this program, then you should first install the following model from nltk as:

>>> import nltk
>>> nltk.download('punkt')
[nltk_data] Downloading package punkt to /home/shiv/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip. 

As mentioned above the sample text which only contains one sentence, sentence tokenization is not relevant. We can use word-level tokenization in that text as:

from nltk.tokenize import word_tokenize

text = 'charles babbage, who was born in 1791, is regarded as the father of computing because of his research into machines that could calculate.'

tokens = word_tokenize(text)

print(tokens)

Output:

['charles', 'babbage', ',', 'who', 'was', 'born', 'in', '1791', ',', 'is', 'regarded', 'as', 'the', 'father', 'of', 'computing', 'because', 'of', 'his', 'research', 'into', 'machines', 'that', 'could', 'calculate', '.']

As we can notice that every non-space character in the sample text is divided into tokens and the result is obtained in the python list format.

 

3. Removing Noise

It is the process of removing irrelevant characters from the text which is called noise in the area of NLP that does not provide any meaning while analyzing text. Most common noises are numbers, punctuation, stop words, white space, etc.

Removing Numbers

tokens = ['charles', 'babbage', ',', 'who', 'was', 'born', 'in', '1791', ',', 'is', 'regarded', 'as', 'the', 'father', 'of', 'computing', 'because', 'of', 'his', 'research', 'into',
         'machines', 'that', 'could', 'calculate', '.']


# Removing numbers
remove_numbers = [token for token in tokens if not token.isdigit()]

print(remove_numbers)

Output:

['charles', 'babbage', ',', 'who', 'was', 'born', 'in', ',', 'is', 'regarded', 'as', 'the', 'father', 'of', 'computing', 'because', 'of', 'his', 'research', 'into', 'machines', 'that', 'could', 'calculate', '.']

Here in this list, we successfully removed ‘1791’ number from the tokens list.

 

Removing Punctuation

import string

tokens = ['charles', 'babbage', ',', 'who', 'was', 'born', 'in', ',', 'is', 'regarded', 'as', 'the', 'father', 'of', 'computing', 
         'because', 'of', 'his', 'research', 'into', 'machines', 'that', 'could', 'calculate', '.']

remove_punctuations = [token for token in tokens if not token in string.punctuation]

print(remove_punctuations)

Output:

['charles', 'babbage', 'who', 'was', 'born', 'in', 'is', 'regarded', 'as', 'the', 'father', 'of', 'computing', 'because', 'of', 'his', 'research', 'into', 'machines', 'that', 'could', 'calculate']

The punctuation like ‘,’ and ‘.’ are removed from the list using the string.punctuation function to check if the token is either punctuation or not.

 

Removing Stop words

The words like “the”, “and”, “in”, “is”, “or”, etc. does not provide any information during text analyzing so we have to remove such words to decrease size and space needed for processing particular text.

from nltk.corpus import stopwords

tokens = ['charles', 'babbage', 'who', 'was', 'born', 'in', 'is', 'regarded', 'as', 'the', 'father', 'of', 'computing', 
         'because', 'of', 'his', 'research', 'into', 'machines', 'that', 'could', 'calculate']

lang_stopwords = stopwords.words("english")

remove_stopwords = [token for token in tokens if token not in lang_stopwords]

print(remove_stopwords)

Output:

['charles', 'babbage', 'born', 'regarded', 'father', 'computing', 'research', 'machines', 'could', 'calculate']

Note: If you first time run this program using “stopwords” in nltk, you have to download “stopwords” in your nltk package as:

>>> import nltk
>>> nltk.download('stopwords')
[nltk_data] Downloading package stopwords to /home/shiv/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

 

4. Stemming

It is the process of converting any particular word into their root word. During the text analysis process, NLP algorithm should consider three different words like “caring”, “cares”, and “careful” as the identical word.

For that we have to convert those words into their root word i.e care. We can easily perform the stemming using NLTK library as:

from nltk import SnowballStemmer

lang="english"

stemmer = SnowballStemmer(lang)

tokens = ['charles', 'babbage', 'born', 'regarded', 'father',
         'computing', 'research', 'machines', 'could', 'calculate']

stemming_tokens = [stemmer.stem(token) for token in tokens]

print("Original tokens", tokens, sep='\n')

print('---------------------------')

print("Stemming tokens", stemming_tokens, sep='\n')

Output:

Original tokens
['charles', 'babbage', 'born', 'regarded', 'father', 'computing', 'research', 'machines', 'could', 'calculate']
---------------------------
Stemming tokens
['charl', 'babbag', 'born', 'regard', 'father', 'comput', 'research', 'machin', 'could', 'calcul']

Here the words “regarded” is stemmed to “regard”, “computing” to “comput”, “machines” to “machin”, etc.

 

Summarized code from above steps:

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import SnowballStemmer
import string


""""
    Python programm to preprocess text using
    NLTK library
                """

# Method: text_preprocessing
# Input: text
# Output: preprocessed text
def text_preprocessing(text):
    # convert text to lowercase
    text = text.lower()

    # word tokenizing
    tokens = word_tokenize(text)

    # removing noise: numbers, stopwords, and punctuation
    lang_stopwords = stopwords.words("english")
    tokens = [token for token in tokens if not token.isdigit() and \
                            not token in string.punctuation and \
                                token not in lang_stopwords]
    
    # stemming tokens
    stemmer = SnowballStemmer('english')
    tokens = [stemmer.stem(token) for token in tokens]

    # join tokens and form string
    preprocessed_text = " ".join(tokens)

    return preprocessed_text

# sample text
text = "Charles Babbage, who was born in 1791, is regarded as the father of computing because of his research into machines that could calculate."

print("The preprocessed text of sample text is:", text_preprocessing(text), sep='\n')

Output:

The preprocessed text of sample text is:
charl babbag born regard father comput research machin could calcul

 

Conclusion

Hence in this blog post we successfully preprocessed the sample text using python NLTK library.

We came to know how raw text can be converted into meaningful text so that it will be easy for algorithms to bring insights from text quickly. Text preprocessing is one of the important steps that should be implemented in every NLP project.

If you have any problem feel free to drop comments down below.

Happy Coding:-)

Leave a Reply