Is data the most important for your machine learning model?

6 min readApr 21, 2022

Introduction

Machine learning is made up of three main components:-

Data
Model/code
Hyperparameters

In most cases, especially in research and academics, we tend to keep our data fixed and try out different models to find out the one that works best on our data, this is the model-centric approach.

Another approach, which is sometimes used in live production-based systems, is the data-centric approach, which involves holding the model fixed and iteratively improving the quality of data which then improves the model performance.

In this article, we would discuss how to improve model performance using data augmentation, error analysis, and maintaining labeling consistency and understand how improving data while keeping hyperparameters and model fixed, can improve model performance i.e follow the data-centric approach.

Improving model performance through data augmentation

What is data augmentation?

It is an approach to increase the volume, and diversity of the data, let me explain.

Suppose you have an input image as below:

The image is of Mr. APJ Abdul Kalam, former president of India.

Using data augmentation we can create different images from the above single image, see the next example.

I have used different rotation angles, zoomed the images in and out, and flipped them horizontally to create multiple images of a single image.

This could be very useful because when faces are considered, your model may receive photos from different angles, if it has been trained to work on such data, it would be easier for the model when it predicts.

Data augmentation is a boon, in cases when you do not have very large datasets at your disposal. For building even a simple deep learning model, you need a good amount of input data, this can be easily possible through data augmentation.

For the above transformation, I have used the data augmentation methods available in Keras.

data_augmentation = keras.Sequential(
  [
    layers.RandomFlip("horizontal",
                      input_shape=(height,
                                  width,
                                  3)),
    layers.RandomRotation(0.1),
    layers.RandomZoom(0.1),
  ]
)

In the above code, we have used the RandomFlip method giving it a parameter “horizontal”, which flips the image horizontally, RandomRotation method rotates the image, and you would pass a negative value to rotate clockwise and a positive value to rotate anti-clockwise.

RandomZoom zooms the input.

You would need to be cautious while using data augmentation if you have a face recognition system, and you know that you would not receive an image that is inverted vertically, you would not train your data on images that are flipped vertically.

Similar to the image data we discussed above, we could augment audio information. Say, that you are building a model that identifies bird sounds, you can add different background noises for example traffic, or people talking, and so on to the original bird sounds, to increase your input data while keeping in mind, to be realistic in your approach.

A pythonic example of data augmentation improving model performance

A case study that I was working on, involved recognizing faces, I collected sample image data for 6 individuals, cleaned the images, and trained a convolution neural network on them.

num_classes = len(class_names)
model = Sequential([
  layers.Rescaling(1./255, input_shape=(width, height, 3)),
  layers.Conv2D(16, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(32, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(64, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Flatten(),
  layers.Dense(128, activation='relu'),
  layers.Dense(num_classes)
])model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

On doing error analysis I found the below.

You can see that after about 2–3 epochs, the training accuracy is increasing steadily but the validation accuracy does not seem to improve. Similarly, with the loss, the training loss decreases with every epoch, but not the validation loss. The model was most certainly overfitting.

To overcome the same I added more data using data augmentation(code we discussed previously) and used Dropouts for regularization.

model = Sequential([
  data_augmentation,
  layers.Rescaling(1./255),
  layers.Conv2D(16, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(32, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(64, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Dropout(0.2),
  layers.Flatten(),
  layers.Dense(128, activation='relu'),
  layers.Dense(num_classes)
])

Adding dropouts and data augmentation does make the model training better than it was before, though I still think there is room for improvement.

Adding features for structured data

Data augmentation techniques that we discussed for unstructured data above, may not apply to structured data. Say, for example, your model is predicting the cafes a person may want to visit, if a certain region has only a limited number of cafes, you may not be able to create more cafes using random information, as the same would be recommended to the users.

In such cases, increasing the number of features in your data may help. Adding features like which cafes allow pets or which cafes provide a good internet connection may be features you would want to add. The decision of which features to keep can be taken by your validation loss.

Why is labeling consistency important?

If you are working on collecting data on your own or with your team, maintaining labeling consistency could be an important step that you should include in your machine learning pipeline.

Consider, a speech recognition system, if you have multiple labelers, they could transcribe the audio in different ways.

Person 1: “Ummm…Where can I look for the Lokmanya Tilak bus stop?”

Person 2: “Umm, Where can I look for the Lokmanya Tilak bus stop?”

Person 3: “[unintelligible], Where can I look for the Lokmanya Tilak bus stop?”

Do you think any of these persons transcribed the audio incorrectly, probably not, but each had a different way of describing the first part of the audio, which might be difficult for your model to identify as the same

You could set down some rules for the labelers like labeling all unintelligible speech as [unintelligible], this would remove any ambiguity from your input data, and improve the model performance.

Let me give you another example, consider your company works on selling refurbished products. You would want to know if there are any scratches on the phone screen and the estimated price of restoring it before buying the phone from a customer. In that case, you first collect data for phone screens with/without scratches and ask your team of labelers to label the images.

Scratch Size 0.01, 0.2, 0.7

Person 1: 1, 1, 1

Person 2: 0,0,1

As you can see there is some disagreement between both the labelers, on what size of the scratch would be considered to be faulty and which would not be.

This would not happen if you would have instructed the labelers to only consider scratches greater than 0.3 mm as faulty. You would receive well-labeled data, free of ambiguity for your model to train. This is why maintaining labeling consistency is important for your model’s training.

Conclusion

As part of this article,

We looked at the data-centric approach for building machine learning models.
We understood data augmentation for structured and unstructured data and looked at some code for the same.
Lastly, we discussed the importance of maintaining labeling consistency.

If you have any feedback on the article, please feel free to comment. Thanks for reading along.