When you are dealing with a Machine Learning task, you have to properly identify your problem so that you can pick the most suitable algorithm. As first thing, namely, you could categorize your task either as supervised or unsupervised and, if supervised, either as classification or as regression (you can read more about it here).

However, this does not lead to a unique solution, since multiple algorithms exist for each category of learning. That’s why we need to evaluate the performance of different models, compare them and pick the algorithm with the highest score.

To do so, we have to establish a training criterion. Indeed, there are multiple ways our algorithm could be trained on our data, some less efficient than others. Hence, in this article I’m going to dwell on three different ways of training, from the least efficient to the most efficient, which is, as the title spoilered, Cross-Validation.

So let’s see all of them.

Training the algorithm on the entire dataset

This is by far the least efficient way of training. Indeed, since the main goal of ML is correctly predict new, unseen data, training and scoring our algorithm using the entire dataset would inevitably lead to a poor-generalized model. Indeed, it will be tested on data on which it was built, which sounds like ‘cheating’: the algorithm already knows the answer to those questions, so it will get a high score but it won’t be able to fit new data well.

As the picture above shows, we have our dataset (the blue rectangle) which entirely ends up being the train set, and our model (in this example, a linear regression) is trained on the whole dataset.

Splitting data into train set and test set

This technique is a great step forward. The idea is using a portion of our dataset (normally 75%) to train our algorithm, and the remaining portion to test the performance of the resulting model. By doing so, we are testing our model on data which, by definition, have never been seen before. In this way we are assuring that, during the evaluation phase, the score will penalize those model which, even if they performed well on the train set, did poor predictions on the test set. Again, it is important to stress this point once more: we are not interested in a model with 100% accuracy on the train set, if it is unable to correctly predict new data.

Now, the model is trained and built only on the green dots, while it is tested on the orange ones.

This method is a great improvement with respect to the previous one, but it is not the most efficient. Indeed, it does not take into account the fact that we are choosing only one of the possible ways in which our dataset can be split: how can we be sure that this is the combination which leads to the best result? Plus, how can we assure that the train set we picked is representative of the whole dataset or, in other words, generalized enough?

Actually we can’t. That’s why we need to introduce a third procedure.

Cross-Validation

The idea of cross-validation arises because of the caveat explained above. It basically wants to guarantee that the score of our model does not depend on the way we picked the train and test set.

It works as follows. It splits our dataset into K-folds, then the model is trained on K-1 folds and tested on the remaining one, for K iterations. So each time, because of the K rotations of the test set, the model is trained and tested on a new composition of data.

This guarantees a well-generalized model and a score which is not affected by the initial splitting decision. Indeed, if there are iterations where the model performed poorly (that, had we not used cross validation, would have meant underestimating the model’s robustness), those will be compensated by iterations where the model performed great (that, again, would have meant overestimating its robustness). Indeed, the final output of our validation will be a vector of K scores, and the final score of the model will be nothing but the mean.

Model selection is a crucial part of a ML pipeline, and it is important to pick the one which guarantees good performances on new data.

Advertisements

Published by valentinaalto

I'm a 22-years-old student based in Milan, passionate about everything related to Statistics, Data Science and Machine Learning. I'm eager to learn new concepts and techniques as well as share them with whoever is interested in the topic.

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: