Natural Language Processing (NPL) is a field of Artificial Intelligence whose purpose is finding computational methods to interpret human language as it is spoken or written. The idea of NLP goes beyond a mere classification task which could be carried on by ML algorithms or Deep Learning NNs. Indeed, NLP is about interpretation: you want to train your model not only to detect frequent words, to count them or to eliminate some noisy punctuations; you want it to tell you whether the mood of the conversation is positive or negative, whether the content of an e-mail is mere publicity or something important, whether the reviews about thriller books in last years have been good or bad.

The good news is that, for NLP, we are provided with a variety of APIs, first of all, the Google APIs for speech and text recognition. Furthermore, there are interesting libraries, also available in Python, that offer a pre-trained model able to inquire about written text. One of these libraries is TextBlob.

In this article, I’m going to dwell on some functionalities of TextBlob and how you can implement it with useful functions. To do that, I will run my Python code on Jupyter Notebook.

from textblob import TextBlob
text = TextBlob("This is my first message. I will use it as an example. There are so many examples")

text.tags

Output:[(u'This', u'DT'),
        (u'is', u'VBZ'),
        (u'my', u'PRP$'),
        (u'first', u'JJ'),

        (u'message', u'NN')]
        ......

Our first operation was kind of grammatical analysis of our first sentence: it returns a list of tuples of the king (word, type). Namely, it recognizes ‘is’ as a verb (VBZ), ‘message’ as a noun (NN) and so forth.

We can also make some operations to modify our text. Let’s examine some nice interventions you can do:

#we can split our text into sentences

text.sentences

Output: [Sentence("This is my first message."),
         Sentence("I will use it as an example."),
         Sentence("There are so many examples.")]

#we can easily access each sentence as a list and then printing its words

text.sentences[0].words[4] #accessing the fifth word of the first sentence

Output: 'message'

#we can extract some grammatical information

text.sentences[0].words[4].pluralize() #convert singular to plural

Output: 'messages'

#we can transate our text ...

text.sentences[0].translate(to='it')

Output: Sentence("Questo è il mio primo messaggio")

#... and detect its language 

text.sentences[0].detect_language()

Output: u'en'


#we can correct our sentence

mistake=TextBlob("This mesage contains erors. Lots of erors.")
mistake.correct()

Output: TextBlob("His message contains errors. Lots of errors.")

#and even get suggestion about spelling correction

from textblob import Word
w=Word('erors')
w.spellcheck()

Output: [(u'errors', 1.0)] #the second number indicates the confidence of the correction, in this case 100%


There is plenty of analyses you can implement in your text. Now let’s examine some further functions you can build to inquire about your text.

The functions I’m talking about are the Term Frequency (TF) and Inverse Document Frequency (IDF), whose product return the so-called TF-IDF function.

  • TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
  • IDF(t) = log(Total number of documents / Number of documents with term t in it).

It’s a way to score the importance of words in a document based on how frequently they appear across multiple documents. These two functions are combined since there are two factors you have to take into consideration:

  • If a word appears frequently in a document, it’s important: the TF will give it a high score, since the numerator is high.
  • If a word appears in many documents, it’s not a unique identifier: the IDF will give it a low score.

Therefore, common words like ‘the’ and ‘for’, which appear in many documents, will be penalized. Words that appear frequently in a single document will be rewarded.

Let’s see how to define those functions and apply them to a book review (I’ve used one of the reviews of the first Harry Potter book) split into three documents:

from __future__ import division
from textblob import TextBlob
import math

review_1=TextBlob("The first book in Rowling’s seven book wizardry 
                   extravaganza is quite undeniably one of the most popular 
                   books to have ever been written. For anyone who has been 
                   living under a rock for the past fifteen years the   
                   Harry Potter books tell the story of orphan Harry James 
                   Potter 
                   and the discovery of his secret magical powers and the role 
                   he plays in the safety of the hidden world of witchcraft and 
                   wizards.The first book, Harry Potter and the Philosopher’s 
                   Stone (known as the Sorcerer’s Stone in the United States) 
                   begins on a seemingly ordinary night on a quiet street in 
                   Surrey, England. Three people gather, an elderly man, a 
                   stern 
                   faced woman and a huge bearded motorbiker, and they talk 
                   about a strange and confusing series of events, including 
                   tragedy and murder, and why this means that they must leave 
                   their charge – a sleeping babe wrapped in blankets – on the 
                   doorstep of one extremely regular house on that extremely 
                   regular street.")
review_2=TextBlob("This tiny sleeping child is Harry Potter, whose parents 
                   supposedly died in a car crash, leaving him with a 
                   lightening 
                   bolt shaped scar across his forehead. He is raised by his 
                   mother’s sister, Petunia and her husband Vernon an office 
                   worker in a drill company. However on his eleventh birthday 
                   a series of very bizarre events lead to the discovery of 
                   Harry’s true identity, he’s a wizard.")
review_3=TextBlob("From there he learns that his parents were murdered by an 
                   evil and power hungry psychopath named Lord Voldermort and 
                   that Harry’s true place is at Hogworts School of Witchcraft 
                   and Wizardry, a magical castle hidden somewhere in the UK. 
                   After a fantastic journey on a huge red steam train from a 
                   hidden on secret platform at London’s King Cross station 
                   Harry finally begins to feel at home and accepted at 
                   Hogworts, 
                   finding friendship in two fellow students Ron Weasley and 
                   Hermione Granger.")

def tf(word, blob):
       return blob.words.count(word) / len(blob.words)


def n_containing(word, bloblist):
    return 1 + sum(1 for blob in bloblist if word in blob)


def idf(word, bloblist):
   return math.log(float(1+len(bloblist)) / float(n_containing(word,bloblist)))


def tfidf(word, blob, bloblist):
   return tf(word, blob) * idf(word, bloblist)

#let's check which are the scores of the word 'the'


bloblist = [review_1, review_2, review_3]
tf_score = tf('the', review_1)
idf_score = idf('the', bloblist)
tfidf_score = tfidf('the', review_1, bloblist)
print tf_score, idf_score, tfidf_score

I want to show how the word ‘the’ is treated. If we print it’s TF and IDF scores, we obtain the following:

tf_score

Output:0.07926829268292683

idf_score

Output: 0.0

So our TF score rewards the word ‘the’, since it is commonly used. Nevertheless, we know that this information is trivial, since ‘the’ is an article and it is supposed to be frequent regardless of its importance in the document. And this is captured by the IDF score, equal to zero. Hence, the word ‘the’ results not to be important at all, which is true.

The final property I want to show is the sentiment analysis: it returns a named tuple of the form Sentiment(polarity, subjectivity). The polarity score is a float within the range [-1.0, 1.0], while the subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

Let’s analysis our first review:

review_1.sentiment

Output: Sentiment(polarity=0.07222222222222223, subjectivity=0.45247863247863246)

If you think about these results, you can see they make sense. Indeed, the review we analyzed, more than a judgment, was a kind of summary – positive, but not too subjective. The polarity is correctly greater than 0 but not too much, while the subjectivity.

Those are just a poor sample of the functionalities you could implement in your analysis, yet it could give you an idea of how some ‘entities’ which live in your smartphones (I’m talking about Siri, Alexa, Google Home) are able to tell you a joke if you ask them, or (maybe a bit more useful in real life) are able to make a research for you based on the question you formulated.

And if you think that some advertisements you see on your devices are there just because you’ve been talking (yes, only talking) about them the last few minutes…well, all of this involves NLP methods.

Advertisements

Published by valentinaalto

I'm a 22-years-old student based in Milan, passionate about everything related to Statistics, Data Science and Machine Learning. I'm eager to learn new concepts and techniques as well as share them with whoever is interested in the topic.

Join the Conversation

1 Comment

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: