Python is a general purpose language and, as such, it offers a great number of extensions which range from scientific programming to data visualization, from statistical tools to machine learning.

It is almost impossible knowing every available extension, however there are a few of them which are pivotal if your task consists of analyzing data and build machine learning models on them.

Hence, in this article I’m dwelling on 5 main packages which will make your Python extremely versatile, that are:

  • Numpy
  • Pandas
  • Matplotlib
  • Scikit-learn
  • Seaborn

So let’s start!

Numpy

Numpy is a package for scientific computing. It allows you to perform any kind of mathematical and statistical operations. In particular (and this is the reason why it is fundamental in Machine Learning), it allows you to perform N-dimensional computations very quickly and easily. Anytime you are asked to manipulate vectors and matrices, you know you have to use Numpy for that purpose.

Let’s now see some examples.

Numpy’s main object is the homogeneous multidimensional array, which might look as either a vector (if dimensions are n,1) or a matrix (if dimensions are n,m). Let’s create a first array containing 1,2,3:

import numpy as np
a=np.array([1,2,3])
a

Output: array([1, 2, 3])

We can check some properties of this array with the following methods:

a.shape
Output: (3,)

a.dtype
Output: dtype('int32')

type(a)
Output: numpy.ndarray

We can also initialize arrays specifying the number of components and the shape. Namely, if I want to create a 3×4 matrix with number from 0 to 11, I will write:

b=np.arange(12).reshape(3,4)
b

Output: array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

You can also convert into arrays objects that are not, like lists. This is very useful whenever you have to make computation with items of lists. Let’s say that you want to subtract those two lists:

list_1=[1,2,3]
list_2=[4,5,6]
list_2-list_1

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-4c6e19f7626c> in <module>
      1 list_1=[1,2,3]
      2 list_2=[4,5,6]
----> 3 list_2-list_1

TypeError: unsupported operand type(s) for -: 'list' and 'list'

As you can see, an error arose since you cannot subtract two lists. But, if you use this trick:

np.asarray(list_2)-np.asarray(list_1)

Output: array([3, 3, 3])

The problem is bypassed!

Pandas

Pandas provides data structures and data analysis tools which are essential for your data to be cleaned and employed for machine learning tasks.

The main objects in pandas are DataFrames, which are nothing but structured datasets which can be easily modified and accessed. You can either create or import (from web, csv files, text files…) your dataframe.

Let’s create one from scratch:

import pandas as pd 

data = [['alex', 10], ['tom', 15], ['jim', 14]] 
df = pd.DataFrame(data, columns = ['Name', 'Age']) 
 
df 

We can access the elements of this df as it was a matrix:

df[:1] #showing only the first row
df.iloc[:,1]

Output: 
0    10
1    15
2    14
Name: Age, dtype: int64

All the columns of a pandas dataframe are Series objects:

type(df['Age'])
Output: pandas.core.series.Series

We can also append new columns to our dataset, as well as setting indexes:

g=['M','M','M']
df['Gender']=g
df.set_index('Name',inplace=True)
df

Pandas is fundamental whenever you deal with huge amount of data, since it is also able to summarize relevant information (like the presence of missing values, outliers, mean and frequencies and so forth).

df.isnull().sum() #for missing values

Output: 
Age       0
Gender    0
dtype: int64
df.describe()

Matplotlib

Matplotlib offers different tools for data visualization. It is not the only visualization package available in Python, nevertheless it is the most intuitive to use and it generates very nice results.

Let’s see how to plot different graphs:

import matplotlib.pyplot as plt
import numpy as np


men_means = [20, 34, 30, 35, 27]

x = np.arange(len(labels)) 

fig, ax = plt.subplots()
ax.bar(x - width/2, men_means, width, label='Men')
ax.set_title('Men Means')

We can also show multiple bars in the same graph:

import matplotlib
import matplotlib.pyplot as plt
import numpy as np


men_means = [20, 34, 30, 35, 27]
women_means = [25, 32, 34, 20, 25]

x = np.arange(len(labels)) 

fig, ax = plt.subplots()
rects1 = ax.bar(x - width/2, men_means, width, label='Men')
rects2 = ax.bar(x + width/2, women_means, width, label='Women')
ax.set_title('Men and Women Means')

Now let’s model a normal random variable with a histogram and normal distribution plot approximation:

import matplotlib.pyplot as plt
mu, sigma = 0, 0.1 # mean and standard deviation
s = np.random.normal(mu, sigma, 1000)
count, bins, ignored = plt.hist(s, 30, normed=True)
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *np.exp( - (bins - mu)**2 / (2 * sigma**2) ),linewidth=2, color='r')
plt.show()

Now imagine we want to plot the result of a survey were people had to say their favorite italian pasta. The result of the survey are:

import matplotlib.pyplot as plt


labels = 'Gnocchi', 'Tortellini', 'Spaghetti', 'Penne'
sizes = [15, 30, 45, 10]
explode=(0,0,0,0)

fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  

plt.show()

You can also emphasize the most popular answer with the option explode:

explode=(0,0,0.1,0)

As data scientist, data visualization is pivotal since you will always have to show your results in an intuitive and powerful way. Furthermore, relevant graphs are often useful to identify pattern in data even before starting building models, hence they might suggest you which kind of analysis you might run.

Scikit-Learn

This is probably the most important package for machine learning, since it provides all the algorithms, ranging from supervised to unsupervised, from classification to regression. Plus, it includes evaluation metrics such as ROC, MSE R squared and so forth, which will be automatically computed after each training of your algorithm.

Let’s see a very easy example of ML task, using the Boston House Price dataset and trying to model the price with respect to just one variable, so that we can visualize it. As it being a regression task (the target variable ‘price’ is continuous), we will use a Simple Linear Regression:

import pandas as pd 
from sklearn.datasets import load_boston

dataset = load_boston()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['target'] = dataset.target

df.head()

Since we want to build a simple linear regression (only one feature), we need to reduce dimensionality from 13 to 1, and to do so without loosing relevant information we need to run a Principal Component Analysis:

from sklearn.decomposition import PCA   
pca = PCA(1)    
projected = pca.fit_transform(dataset.data)  
print(dataset.data.shape)  
print(projected.shape) 

Output:
(506, 13)
(506, 1)

Nice, now let’s import and train our model:

from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(projected, dataset.target, random_state=0)
from sklearn.linear_model import LinearRegression  
lm = LinearRegression()  
lm.fit(X_train,y_train)  
y_pred = lm.predict(X_test)

#let's visualize the results

plt.scatter(X_test, y_test,  color='black')  
plt.plot(X_test, y_pred, color='blue', linewidth=3)  
plt.xticks(())  
plt.yticks(())  
plt.xlabel('First Principal component')  
plt.ylabel('price')   
plt.show()

We can also ask for a ‘feedback’ of the performance of our algorithm:

from sklearn.metrics import mean_squared_error, r2_score  
print("MSE: {:.2f}".format(mean_squared_error(y_test, y_pred)))  
print("R2: {:.2f}".format(r2_score(y_test, y_pred)))

Output:

MSE: 73.04
R2: 0.11

As you can see, with a few lines of code (and less than 2 minutes) we trained a ML model without any manual computation.

Seaborn

Like matplotlib, seaborn is a Python packages for data visualization. However, it is meant to be particularly useful for statistical representations, and it returns more relevant information about your data.

In particular, it is very handy for showing possible correlations among data: with pairplot() and heatmap() you can have a first, significant glimpse of relationships among all the features (and targets):

import seaborn as sns
sns.set(style="ticks")

df = sns.load_dataset("iris")
sns.pairplot(df, hue="species")
sns.heatmap(df.corr(),annot=True)

We can also visualize the joint distribution of data (let’s say, of the two features sepal_length and sepal_width):

sns.jointplot(x='sepal_length',y='sepal_width',data=df,size=5)

Finally, let’s have a look at the distribution of sepal_length values for each specie:

ax=sns.boxplot(x='species',y='sepal_length',data=df)
ax=sns.stripplot(x='species',y='sepal_length',data=df,jitter=True,edgecolor='gray')

Seaborn is extremely quick and powerful to display relevant information and, if you are performing exploratory analysis, it might save a lot of your time, suggesting you clues about the best algorithm you could pick.

Needless to say, covering all the potentialities of those packages would be almost impossible. However, it is important to know which are the tools you need and how to deploy them during your analysis. As a good practice, remember that whatever kind of computation you need for your analysis, python provides a quickest and smartest way to do so: discovering by doing is a very good strategy to explore those tools.

If you are interested in learning more about those packages, here there are the official documentations:

Published by valentinaalto

I'm a 22-years-old student based in Milan, passionate about everything related to Statistics, Data Science and Machine Learning. I'm eager to learn new concepts and techniques as well as share them with whoever is interested in the topic.

Join the Conversation

1 Comment

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: