Natural Language Processing (NPL) is a field of Artificial Intelligence whose purpose is finding computational methods to interpret human language as it is spoken or written. The idea of NLP goes beyond a mere classification task which could be carried on by ML algorithms or Deep Learning NNs. Indeed, NLP is about interpretation: you want to train your model not only to detect frequent words, to count them or to eliminate some noisy punctuations; you want it to tell you whether the mood of the conversation is positive or negative, whether the content of an e-mail is mere publicity or something important, whether the reviews about thriller books in last years have been good or bad.
The good news is that, for NLP, we are provided with a variety of APIs, first of all, the Google APIs for speech and text recognition. Furthermore, there are interesting libraries, also available in Python, that offer a pre-trained model able to inquire about written text. One of these libraries is TextBlob.
In this article, I’m going to dwell on some functionalities of TextBlob and how you can implement it with useful functions. To do that, I will run my Python code on Jupyter Notebook.
from textblob import TextBlob text = TextBlob("This is my first message. I will use it as an example. There are so many examples") text.tags Output:[(u'This', u'DT'), (u'is', u'VBZ'), (u'my', u'PRP$'), (u'first', u'JJ'), (u'message', u'NN')] ......
Our first operation was kind of grammatical analysis of our first sentence: it returns a list of tuples of the king (word, type). Namely, it recognizes ‘is’ as a verb (VBZ), ‘message’ as a noun (NN) and so forth.
We can also make some operations to modify our text. Let’s examine some nice interventions you can do:
#we can split our text into sentences text.sentences Output: [Sentence("This is my first message."), Sentence("I will use it as an example."), Sentence("There are so many examples.")] #we can easily access each sentence as a list and then printing its words text.sentences[0].words[4] #accessing the fifth word of the first sentence Output: 'message' #we can extract some grammatical information text.sentences[0].words[4].pluralize() #convert singular to plural Output: 'messages' #we can transate our text ... text.sentences[0].translate(to='it') Output: Sentence("Questo è il mio primo messaggio") #... and detect its language text.sentences[0].detect_language() Output: u'en' #we can correct our sentence mistake=TextBlob("This mesage contains erors. Lots of erors.") mistake.correct() Output: TextBlob("His message contains errors. Lots of errors.") #and even get suggestion about spelling correction from textblob import Word w=Word('erors') w.spellcheck() Output: [(u'errors', 1.0)] #the second number indicates the confidence of the correction, in this case 100%
There is plenty of analyses you can implement in your text. Now let’s examine some further functions you can build to inquire about your text.
The functions I’m talking about are the Term Frequency (TF) and Inverse Document Frequency (IDF), whose product return the so-called TF-IDF function.
- TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
- IDF(t) = log(Total number of documents / Number of documents with term t in it).
It’s a way to score the importance of words in a document based on how frequently they appear across multiple documents. These two functions are combined since there are two factors you have to take into consideration:
- If a word appears frequently in a document, it’s important: the TF will give it a high score, since the numerator is high.
- If a word appears in many documents, it’s not a unique identifier: the IDF will give it a low score.
Therefore, common words like ‘the’ and ‘for’, which appear in many documents, will be penalized. Words that appear frequently in a single document will be rewarded.
Let’s see how to define those functions and apply them to a book review (I’ve used one of the reviews of the first Harry Potter book) split into three documents:
from __future__ import division from textblob import TextBlob import math review_1=TextBlob("The first book in Rowling’s seven book wizardry extravaganza is quite undeniably one of the most popular books to have ever been written. For anyone who has been living under a rock for the past fifteen years the Harry Potter books tell the story of orphan Harry James Potter and the discovery of his secret magical powers and the role he plays in the safety of the hidden world of witchcraft and wizards.The first book, Harry Potter and the Philosopher’s Stone (known as the Sorcerer’s Stone in the United States) begins on a seemingly ordinary night on a quiet street in Surrey, England. Three people gather, an elderly man, a stern faced woman and a huge bearded motorbiker, and they talk about a strange and confusing series of events, including tragedy and murder, and why this means that they must leave their charge – a sleeping babe wrapped in blankets – on the doorstep of one extremely regular house on that extremely regular street.") review_2=TextBlob("This tiny sleeping child is Harry Potter, whose parents supposedly died in a car crash, leaving him with a lightening bolt shaped scar across his forehead. He is raised by his mother’s sister, Petunia and her husband Vernon an office worker in a drill company. However on his eleventh birthday a series of very bizarre events lead to the discovery of Harry’s true identity, he’s a wizard.") review_3=TextBlob("From there he learns that his parents were murdered by an evil and power hungry psychopath named Lord Voldermort and that Harry’s true place is at Hogworts School of Witchcraft and Wizardry, a magical castle hidden somewhere in the UK. After a fantastic journey on a huge red steam train from a hidden on secret platform at London’s King Cross station Harry finally begins to feel at home and accepted at Hogworts, finding friendship in two fellow students Ron Weasley and Hermione Granger.") def tf(word, blob): return blob.words.count(word) / len(blob.words) def n_containing(word, bloblist): return 1 + sum(1 for blob in bloblist if word in blob) def idf(word, bloblist): return math.log(float(1+len(bloblist)) / float(n_containing(word,bloblist))) def tfidf(word, blob, bloblist): return tf(word, blob) * idf(word, bloblist) #let's check which are the scores of the word 'the' bloblist = [review_1, review_2, review_3] tf_score = tf('the', review_1) idf_score = idf('the', bloblist) tfidf_score = tfidf('the', review_1, bloblist) print tf_score, idf_score, tfidf_score
I want to show how the word ‘the’ is treated. If we print it’s TF and IDF scores, we obtain the following:
tf_score Output:0.07926829268292683 idf_score Output: 0.0
So our TF score rewards the word ‘the’, since it is commonly used. Nevertheless, we know that this information is trivial, since ‘the’ is an article and it is supposed to be frequent regardless of its importance in the document. And this is captured by the IDF score, equal to zero. Hence, the word ‘the’ results not to be important at all, which is true.
The final property I want to show is the sentiment analysis: it returns a named tuple of the form Sentiment(polarity, subjectivity). The polarity score is a float within the range [-1.0, 1.0], while the subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.
Let’s analysis our first review:
review_1.sentiment Output: Sentiment(polarity=0.07222222222222223, subjectivity=0.45247863247863246)
If you think about these results, you can see they make sense. Indeed, the review we analyzed, more than a judgment, was a kind of summary – positive, but not too subjective. The polarity is correctly greater than 0 but not too much, while the subjectivity.
Those are just a poor sample of the functionalities you could implement in your analysis, yet it could give you an idea of how some ‘entities’ which live in your smartphones (I’m talking about Siri, Alexa, Google Home) are able to tell you a joke if you ask them, or (maybe a bit more useful in real life) are able to make a research for you based on the question you formulated.
And if you think that some advertisements you see on your devices are there just because you’ve been talking (yes, only talking) about them the last few minutes…well, all of this involves NLP methods.