PAGE UNDER CONSTRUCTION

Preprocessing of textual data

A lot of captured data exists in the form of text. Social media posts, reviews, books, captions, transcripts, news articles, completed forms etc are all examples of text heavy data that contain information that could be of use. For example, if we want to develop a sentiment analysis program that will determine if a review is positive or negative then we will need to process a lot of text based reviews. Another example of utilizing text in machine learning solutions is the automatic generation of captions for images. Hopefully these example can convince you of the utility of text data in your machine learning solutions.

Text data is useful for machine learning solutions but also comes with some unique issues. Language is complicated, and so is the text which represents it. We have different tenses of verbs, words which can mean various different things depending on the context and many ways of saying the same thing. On top of this, most machine learning models expect numerical inputs and not a string. We therefore need to pre-process our input data to make learning easier for machine learning models. One thing that we will ignore in this discussion is textual data which represents categorical variables, such as a country names, as categorical data is better dealt with in other ways, see the encoding section for example.

There are a number of packages in python built for text pre-processing such as spacy and NLTK. Another key package is the inbuilt re package which is used to implement regular expressions in python, which are essentially a language unto themselves. We won't cover regular expressions here but they are very useful and useful websites exist to test out your regular expressions such as regex101.

Lemmatization

Lemmatization is the process of grouping together different forms of a word into a common term. For example, learn, learning, learns, learned, learnt could all be reduced to "learn". For example, using the spacy package:

import spacy
nlp = spacy.load("en_core_web_sm")
lemmatizer = nlp.get_pipe("lemmatizer")
doc = nlp("study studying studies studied")
print([token.lemma_ for token in doc])

['study', 'study', 'study', 'study']

Stemming

Stemming is when we lemmatize but instead of reducing to a common term we remove all the potentially variable endings. An example of stemming would be reducing the list of study, studying, studies, studied to "studi". For example, using the PorterStemmer from the NLTK package:

from nltk.stem import PorterStemmer
nltk.download("punkt")
ps = PorterStemmer()
print(ps.stem('studying'))
print(ps.stem('studies'))
print(ps.stem('study'))
print(ps.stem('studied'))

studi studi studi studi

Most stemmers, such as the Porter stemmer used above are rule based and prone to both over- and under-processing text. For instance, if instead of using study for our example we use buy then the equivalent output would be:

from nltk.stem import PorterStemmer
nltk.download("punkt")
ps = PorterStemmer()
print(ps.stem('buying'))
print(ps.stem('buys'))
print(ps.stem('buy'))
print(ps.stem('bought'))

buy buy buy bought

Here, the irregularity of the verb in the past tense means that it isn't stemmed back to the common "buy".

Removal of stop words

Stop words are words so common that they carry little information for natural language purposes such as "the", "and", "or", "for" etc. Removing these words allows the system to focus on the more information rich part of a sentence rather than terms that are more common but hold no contextual information.

Tokenization

Tokenization is the process of breaking a string into smaller, hopefully meaningful, parts. A standard approach is to split on white spaces and punctuation. An NLTK example is:

from nltk.tokenize import word_tokenize
tokenized_words=word_tokenize("This is an example sentence. This is the second sentence. Hello Mr. Bond.")
print(tokenized_words)

['This', 'is', 'an', 'example', 'sentence', '.', 'This', 'is', 'the', 'second', 'sentence', '.', 'Hello', 'Mr', '.', 'Bond', '.']

Of course, you can amend this tokenization process to split on different characters. You can tokenize your string on sentences an other structures too with the correct configurations. An example of a sentence tokenizer using the spacy package is:

from spacy.lang.en import English
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe('sentencizer')
doc = nlp("This is an example sentence. This is the second sentence. Hello Mr. Bond.")
for sent in doc.sents:
    print(sent)

This is an example sentence. This is the second sentence. Hello Mr. Bond.

You can see here that only 3 sentences have been identified even though there is a full stop character in "Mr. Bond". This is because the tokenizer recognizes certain utilizations of full stops such as in names and abbreviations for countries, like U.K.