The majority of people (including me) tend to think that the core activity of building a Machine Learning algorithm is, doubtlessly, building the algorithm itself. Concretely, it means working with actual data, inferring their structure and make predictions. Well, it emerged from a survey of some years ago that data scientists normally spend 80% of their time inspecting, cleaning and organizing their data, devoting only the remaining 20% to build the actual algorithm. This situation, known as the 80/20 rule, is today intended to be broken to improve efficiency, yet data preparation still represents more than half of the work.
The science of preparing your data so that your algorithm can perform the best is called features engineering. Indeed, not only it includes some adjustments of existing features, but also the creation of new ones which might contain new, relevant information.
Throughout this article, I’m going to examine some practices which might help you select the most meaningful features. Further, I’ll dwell on some techniques to make your selected features more performing for your algorithm.
So let’s start with some strategies:
- Features must be related to the objective. The art of summarizing relevant information has an important role also in data science. It would be counterproductive to keep redundant or meaningless features in our model: they would make it far too complex and heavy to train. Of course, what makes a feature relevant is its relative objective. Namely, let’s say we want to build a model which predict the probability of having a specific disease. We have thousands of patients’ medical records, but do we need all of them? If we are interested in lungs’ diseases, we won’t look at knee’s RMI scan…but if we are looking for some indicators of obesity, it might become relevant.
- Features must be known at prediction time. It might sound obvious, but it often happens that, while collecting data, we forget that predictions will be developed in a second moment, hence we have to make sure the data we collected will be available also at that time. Let’s consider this example: you want to build a model which predict the dividend your company is going to pay to its shareholders. You will use as features the number of employees, the hours spent working and the profit of the last balance sheet. If prediction time is at the end of the year, you are lucky since your company’s profit will be published with its balance sheet. However, if you want to repeat your analysis 6 months later, you will have only 2 features out of 3, since profits, at that time, won’t be available.
- Features should be numeric with a meaningful magnitude. If we think of any algorithm, it’s clear that they are mathematical models and, as such, they consist of a series of mathematical operations. Consequently, they won’t be able to train on features which are not numeric. It does not mean that categorical features can’t be used in training: simply, they will need to be transformed through a process called one-hot encoding, which I’m going to explain later on. Furthermore, they should have a relevant magnitude: it means that, if an observation exhibits a value of 1 and another one a value of 2, this latter is twice as important (or big) as the former. If we think of some variables, like Orders IDs, we understand that they are numeric, but do not have a meaningful magnitude (if an order has an ID of 120, it doesn’t mean it is twice as important as the order with ID equal to 60).
- Features should have enough examples. The idea is that, if features tend to be poorly assorted and are unbalanced, then our algorithm will be biased. Imagine we want to predict, from a picture, which fruit are we watching at: among our features, we selected ‘color’, but it comes out that this feature exhibit only ‘green’ values. The model won’t be able to generalize its prediction once applied to new data, hence, we’d rather exclude this feature from our analysis.
All of these approaches refer to the pre-processing phase: in this step, you are barely touching your data, you are only deciding, basing on some criteria, which features to pick.
Now let’s see some modification of your features.
- Imputing values. It may happen, while collecting data, to have some missing values. In this case, you can either delete all the entries which exhibit one or more NA values or imputing some new, relevant values. A first approach might be replacing them with the mean or median of the feature. Note that, sometimes, a trivial situation might occur: if you are facing a feature with 0-1 binary values, it might be possible that, if values are not available, they will be represented with a 0. The risk is considering it as an actual value and not as a missing value. So, in this case, a good approach is creating a new column (again, a 0-1 column) where 1 means the value of the former column is available, whereas 0 means it is not.
- Dealing with categorical values. As anticipated, we cannot feed our algorithm with not numeric values. Attributing to them numerical classes, though, might be useless too if they have not a meaningful magnitude.
We have a feature called ‘Color’ which exhibits three entries: Green, Yellow and Red. These values are categorical, and we might decide to encode them as 1, 2 and 3. However, it would be meaningless, since it is not true that Yellow (equal to 2) is two times Green (equal to 1). Alternatively, we can employ what I mentioned above, the one-hot encoding:
By doing so, we are creating three new columns, and now each value is represented by a vector, where the position of the number 1 is the indicator of the kind of values. There is no hierarchy relation between [1 0 0] and [0 1 0].
- Handling outliers. Outliers are values which are far away from the mean. They can be univariate (when an observation represents an anomalous value for one feature only) or multivariate (when multiple features of that observation are far from the mean). The first step you can take to identify outliers is by visualizing your data: boxplots, scatterplots, and histograms are useful tools to do so. A bit more complicated might be inquiring about the cause of this anomalous value, since this is the information you need if you have to decide if keeping or dropping your outlier. Indeed, the anomaly might be caused by an error while gathering data. Namely, if you collected some physical features about men working in the company X, and you observe in your ‘height’ feature a value of 7 meters…well, you can assume that this is an error. On the other hand, if your observation exhibits a genuine anomaly, you might be interested in keeping this information so that the model could incorporate it and detect future anomalies. Some techniques you can use are log-transformation (so that you reduce the variance and do not undermine your model) or, more complexly, you can extract all of your outliers and build a specific model on them.
- Feature crosses. This technique is a bit less intuitive than the previous ones, yet it’s incredibly powerful to gather and summarize relevant information. The idea of feature crosses is that from existing variables we can create new features (as a transformation of existing features) which contain much more information and, specifically, information which is more relevant for the task. Let’s visualize it with an example:
In such a situation, we can easily separate, linearly, our two clusters (note: when I say linearly separable, I mean that it exists a linear combination of my two features which well segregates my classes). Now consider the following:
Here, it is not possible to find a linear combination of x1 and x2 such that blues and red dots are segregated. But what happens if introduce a new features x3, defined by the product of x1 and x2? Well, we are now able to define a rule which can segregate our classes. How? Look at this:
We can conclude that, if the sign of x3 is positive (so x1 and x2 have the same sign) the class is blue, whereas if the sign of x3 is negative (so x1 and x2 have different signs) the class is red.
As you can see, we transformed a non-linear task into a linear task by introducing a new feature, and we came up with an incredibly simple decision rule.
- Binning. This procedure can be seen as a ‘clusterization’ of your features’ values. If one feature exhibits continuous values but their contribution does not depend on every single value, but rather on a group of values, it might be efficient deciding to aggregate, or bin, them into larger classes. Let’s consider the ‘Temperature’ feature of the R embedded dataset ‘Airquality’:
We might decide that small variations of temperature do not affect our label so much, so clustering those data into four classes would not affect the model in terms of loss of information, but it will be far easier to handle the Temperature feature.
- Scaling. In many previous articles, I’ve been stressing the importance of facing values which lies on the same scale. Indeed, we might have a feature with decimal values, which will be handled together with features with values of the order of thousands. Hence, a good practice is normalizing or standardizing your data before processing them, so that their being differently scaled will not affect the accuracy of your model (note that scaling your features is a useful technique to ‘keep quiet’ eventual outliers). Here is the difference between normalization and standardization:
This is not an exhaustive list of all the features’ manipulation you can carry on. While inspecting your data, you will realize that this ideal list is potentially limitless, since you can be extremely creative and adjust your features accordingly to your task. What is important to keep in mind is that, with simple and quick changes, you can incredibly improve your model and, most important, make your task far easier to handle.