Whenever you are about to inspect and manage some data, one of the first inconvenient which might arises is the presence of some missing values. Together with eventual outliers, they might affect the robustness of your Machine Learning model, it is worth spending some extra time during your cleaning procedure and investigating about the nature of those missing values.
Indeed, there are different approaches towards those values, depending mainly on the reason behind the presence of missing values. In general, data could miss in three different ways:
- At Random: it means that the fact that one entry is missing does not depend on its value, but on that of other entries (not missing)
- Completely at Random: the fact that one entry is missing in neither related to that entry nor to any other entry
- Not at Random: it exists a correlation between the absence of values and some other factors, which might be hypothetical (‘out-of-data’ hypothesis) or included among the variables of the same dataset.
Intuitively, understanding the nature of missing values makes the difference in terms of information they might or might not bring to your analysis. Namely, a value missing at random (or completely at random) it’s hard to be interpreted since you have no clues about the reason it is missing or the kind of information it would have brought if not missing. Hence, deleting it from your dataset is probably the least ‘painful’ intervention you could take.
On the other hand, if a missing value is due to some correlations with external or internal factors, you might be interested in inquiring about that correlation and, maybe, be able to impute a relevant substitute to that value.
For example, imagine you are running a survey about the domestic use of technological tools. The people you are going to interview are aged between 20 and 75, and the output of your survey will be, for each interviewed, a score from 0 (for no use of technology at all) to 10 (for technology addiction). You decide to run the survey via e-mails and, once collected all the answers, you notice that there is plenty of missing values from people older than 60 years-old. You might interpret this as a fact that older people tend not to rely on tech devices as mush as those younger might do. Hence, you could impute to your missing values a low score, which is consistent with your hypothesis.
Of course, if there is no evident correlation which can explain the absence of data, there are ‘safer’ ways to impute missing values, like the mean or median of the variable.
In this article, I’m going to handle the missing values of the Titanic Dataset using Missingo (you can install it via pip) to detect and visualize them.
import missingno as msno df = pd.read_csv('train.csv') df.head() df.isnull().sum()
As you can see, the two variables Age and Cabin have a bunch of missing values which need to be handled.
import missingno as msno msno.matrix(df)
The picture above shows the amount and positions of missing values. The idea is to capture not only missing values, but also data sparsity. It is very intuitive to interpret, plus you can have an idea of the way missing values are distributed withing a variable and among variables. You can also have an even more direct visualization of your missing values with this bar graph:
Thanks to that representation, we can see that the ‘Cabin’ variable is missing more than 80% of its values, while ‘Age’ has less than 20% missing values.
Furthermore, we might be interested in investigating about the possible correlation among those missing values:
As you can see, it seems to be a slightly positive correlation between the absence of values in Cabin and Age. However, examining the previous picture, we can see that there are 687 missing cabins against 177 missing ages…there is a clear unbalance in our data, hence the positive correlation might be trivial. Let’s have a deeper look at this correlation with another kind of visualization:
This is a dendrogram, which uses a hierarchical clustering algorithm to bin variables against one another by their nullity correlation (measured in terms of binary distance). If you want to learn more about hierarchical clustering, you can read my former article here.
So, how can we deal with our missing values in columns ‘Age’ and ‘Cabin’? Well, as shown in the pictures above, Cabin variable is missing more than 80% values, hence replacing them would mean relying on a feature which is almost entirely guessed. Furthermore, it wouldn’t make sense computing the mean or median or any approximation derived from the non-missing values: indeed, to have a consistent approximation we should have a map of the boat to check all the numbers of cabins. Namely, if we know that there are 3 cabins, but our data only shows cabin 1 and 2, it wouldn’t make sense imputing to the missing entry a value of 2.5. Hence, this might be a case where deleting an entire feature wouldn’t affect that much the reliability of our potential algorithm.
What about the column ‘Age’? In that case, it seems more logical to impute, namely, the average age:
Now, if we have a look once more at the bar graph of missing values:
We can see that all the entries are filled.
In this article, I’ve been dwelling mainly on the visual representation of missing values and their easiest imputation strategy. However, there is plenty of further and deeper analyses you could perform on your columns. To give an example, we could have managed our Age feature by building a regression model to predict the missing ages, or even a K-NN algorithm, which weights samples using the mean squared difference on features for which two rows both have observed data (we will se an implementation later on).
If you want to focus more on these topics related to more complex imputations, I recommend you these two further readings about some useful tools in Python (sklearn.imputer() and fancyimpute()):