Effects of copy-pasting training data for training deep neural networks

Karthik M Swamy
3 min readOct 31, 2021

Introduction

One of the important aspects to training a deep neural network (DNN) is the data available for training. Typically, the data science team communicates this information to the users but fails to clarify what this actually means. In this post, we will see what this means, what users typically do to mitigate this issue and the implications of such an action.

Source: link

Background

Deep learning algorithms learn on large datasets by using mini batches of data. Typically, these mini batches are sized in multiples of 32 up to a maximum of 1024 instances. These mini batches of instances are also randomly chosen.

When the data available for training is split into train, validation, and test, more often this is done with an 8:1:1 ratio. For example, if the available dataset is 5000 instances, the split allows 4000 instances for training, 500 instances for validation and 500 more for test. Hence, DNNs would easily overfit to the available train data.

What users do?

When data science teams suggest having a large dataset of around 10000 instances for training, users often think that this is the number of instances that are required for training a DNN. Hence, users often end up copy-pasting data in order to fulfil such a requirement.

Impact of copy-pasting training data

When mentioning that DNNs learn better with large training data, what the data science team actually means is a dataset with different instances. The reason why this works is that different instances provide new information for the neural network to exploit and learn from.

However, when users simply copy-paste train data, they do not provide new information for the neural network to learn from. Instead, the copy-pasting results in adding the same data, providing no new information for the neural network to learn from. Hence, training a DNN on copy-pasted data results in the DNN looking at the same data more number of times each epoch of its training with no new information to learn. This results in the DNN training longer with no improvement in training.

In the above data split example, if a user copy-pastes the train data 2.5 times to get to 10000 instances, the DNN looks at the same data 2.5x more each epoch. In other words, the DNN would be training 2.5 epochs every epoch, resulting in no further or slight improvement in model metrics. A slight improvement could be see owing to the sheer number of parameters available for the DNN to exploit better features.

Similarly, this would result in no improvement in algorithms that are not DNN-based. Examples of such methods include Random Forests, Decision Trees, XGBoost, CatBoost and so on. These methods, unlike DNN-based methods, do no use mini batches for training.

How can a business or data teams evaluate what is large dataset?

The age old question that one often deals with when working with DNNs is how many instances are required for training the DNN with good performance metrics. A good ballpark that could be taken to answer this question is around 1000 instances per category trained. What this means is that if we are training a classifier to detect 10 classes of information, a minimum of 10000 instances of train data, with each class having at least 1000 instances of data for training.

Summary

Copy-pasting train data leads to no significant improvement in performance metrics of the model but only increases the training time. While copy-pasting might seem to solve a short term issue of insufficient data for training DNNs, it provides no added value in training a better model.

The best possible solution to improve the performance metrics for training a better model is to get new train data that are completely new for training the DNN.

--

--

Karthik M Swamy

Sr. Data Scientist at SAP, Google Developer Expert in Machine Learning