On the Importance of Data in Training Machine Learning Algorithms — Part Two

Published in

Analytics Vidhya

5 min readJul 2, 2021

In the part one of this blog series, we discussed the motivation for exploring the characteristics of data and introduced the dataset that we are going to work with. We looked at the data distribution and the characteristics of the BestBuy dataset in the previous post.

Experimental Setup

In this post, we will delve a bit deeper into the effects of training a dataset using increasingly more number of records. As discussed earlier, we will keep the classifier and the test records to be the same while we vary the number of records used for training.

Once we do this for one classifier, we will change the classifier and repeat the steps. This will ensure that we can confirm the impact of the number of records available for training, on the final metrics, using different algorithms.

Given that we have ~48000 records in our BestBuy dataset, we will shuffle the dataset and keep the ~8000 records for testing. We can easily achieve this using pandas as below:

"""
Splits a panda dataframe in two. Test split remains a constant.
"""dataset = dataset.sample(frac=1.0, random_state=1729)test_dataset = dataset[40000:]train_dataset = dataset[:num_train]

By setting the random_state value to a constant, we ensure that the test set is randomized but also the same every single time the dataset is loaded by this method. Do you have any suggestions to do this any different?

Training on the Dataset

As mentioned in the previous post, we will use TensorFlow Decision Forests to train different algorithms — Random Forest, Gradient Boosted Decision Trees and CART model. Let’s quickly look at what TensorFlow Decision Trees has to offer:

TensorFlow Decision Forests (TF-DF) is a collection of Decision Forest (DF) algorithms available in TensorFlow. Decision Forests work differently than Neural Networks (NN): DFs generally do not train with backpropagation, or in mini-batches. Therefore, TF-DF pipelines have a few differences from other TensorFlow pipelines.

One of the key features of the TF-DF framework is that it is not training a neural network. What this naturally implies is that it does not fall into the trap of overfitting to the data due to overparameterization. A consequence of this is that the data need not be split into a train, valid and test but just train and test sets. Data for validation is not required as the framework does not have to monitor if it might overfit to the data on which it is being trained.

Bear in mind that if you are doing a hyperparameter optimisation yourself using the TF-DF algorithms, then it might be reasonable to still split the data into train, valid and test sets.

The next feature of using this framework is that there is no requirement to shuffle the dataset or meddle with batch sizes. Unlike conventional neural networks, such frameworks do not batch the data to learn from it but rather reads the complete dataset into memory to learn from it. While this might be an issue for learning from larger datasets, it offers considerable advantages for datasets that can be fit into memory in terms of the training time.

The best practice, suggested by the authors of TF-DF, for training on large datasets is to use a subset of the training data to see the impact on performance and memory implications. The rationale behind such an approach is the diminishing returns to increasing the size of the dataset.

Some of the other points noted by the authors are the following:

Do not transform the data with feature columns
Do not preprocess the features
Do not normalise numerical features
Do not encode categorical features
Do not replace missing features by magic values

While the points above might be completely contradictory to what is taught in machine learning courses, the reasons that the authors suggest here is that the framework has in-built capabilities to do all the transformations, normalisations and preprocessing required by each algorithm — making the preprocessing steps that you have redundant.

TL;DR — Show me the Results

If you managed to read this post until this point, then congratulations. You have waited and shall be rewarded. We will look at the results and then see how the algorithms were built to achieve these results.

The first comparison that we will see is the impact of the number of records on each of the three algorithms that can be trained with TF-DF.

Impact on Train Time with Number of Records

In the plot above, we see that with more number of records trained, Gradient Boosted Decision Trees take significantly longer time to train while Random Forests model is quite reasonable and linear in terms of training time. CART on the other hand seems to have quite a low value for train time making it a viable candidate against neural networks if the performance metrics are also competitive enough.

Impact on Accuracy of Model with Number of Records

In the plot above, we see the comparison of the metric (here accuracy) with the number of records. It is indeed fascinating to note how all the three algorithms do quite similarly until the number of records are quite low, giving the CART model quite an edge as we know how quickly we could train the model. As the number of records increase, we can see that GBDT and RF begin to excel. When we use 32000 records, Random Forests seem to emerge the winner in terms of accuracy metric and the train time required.

Building the Models and Plots

We have looked at the plots and scores of the different models with the BestBuy dataset in this post. In the next post, we will look at how to build these models and the plots. We will also come up with a new metric in order to evaluate how sustainable a machine learning model is when training with such datasets.

Do you have any ideas already? Would you like to look at different plots to evaluate performance of these algorithms? Let me know what you think.

What Next?

I will share the code and how I structured these experiments in my next blog. If you are interested in looking at that, you may head over to part three of this series.