Data Science

Creating Efficient Data frames Using Pandas

Save time using this technique

5 min readApr 30, 2022

Introduction

Often in machine learning competitions or production-based settings for your company, you would have to work with large datasets. If you load/create a data frame that is not memory efficient, it would take up a lot of time during your data analysis. The time involved in querying a data frame can be reduced if the data frame takes up lesser space in RAM (is stored efficiently).

This would be our topic of discussion in this article.

Index Of Contents
· Introduction
· 1) Let’s begin by creating a dummy data frame
· 2) Optimizing by storing as a categorical feature
· 3) Optimizing by using a smaller data type
· 4) Storing features as boolean
· 5) Loading a CSV efficiently
· 6) Conclusion
· 7) References

1) Let’s begin by creating a dummy data frame.

We will start with importing basic libraries.

import pandas as pd
import numpy as np

We would be using the np. random library to help us in creating the dummy data for this article.

Suppose we are analyzing the previous performance and predicting who would win the Indian Premier League this year. We would collect data and create the following columns in a data frame.

Team: Name of the team

Age: The number of years the team has been playing for. (Some teams are playing in IPL since its inception, but we do have some teams that have been created recently as well.)

Won_Before: This field indicates if the team has won an IPL tournament before.

Win_prob: Probability of winning this IPL (this is the target field)

The next function can be used for creating a dummy data frame.

def create_df(size):
    data = pd.DataFrame()
    data["team"] = np.random.choice(['Chennai Super Kings','Mumbai Indians','Kolkata Knight Riders','Gujarat Titans'],size)
    data["age"] = np.random.randint(1,10,size)
    data["won_before"] = np.random.choice(['Yes','No'],size)
    data["win_prob"] = np.random.uniform(0,1,size)
    return data

Size is the parameter that specifies the number of rows in the data frame. Let’s create a data frame with 1000000 rows.

If you check data.info, you would get an idea of the amount of memory used by the data frame.

The data frame occupied almost 305 MB, we would try to optimize its storage in the next sections of the article.

2) Optimizing by storing as a categorical feature

In our data frame, the field team is a categorical feature. If we change the data type of the team field from a string(available currently) to a category, we might be able to save some space.

If your feature has the same values repeated over and over again, instead of storing them multiple times as a string you can store them as a categorical variable. This is a form of lossless compression. We are not losing any data but are storing the same data efficiently.

data["team"] = data["team"].astype("category")

Post making this change a quick look at data .info return the following.

Voila! We were able to reduce the memory usage from 305MB to 238.4 MB.

3) Optimizing by using a smaller data type.

If you have numerical columns in your data, you might want to use smaller data types to store them, this would help in efficiently storing the data.

Every data type has a certain range of values it can hold.

For uint8 data type in python can store values from 0 to 255

Whereas, an int32 can store values from (-2147483648 to 2147483647)

Based on the range that a certain data type can handle it would take up a corresponding amount of space in memory.

Storing your features efficiently would involve choosing a data type considering the value ranges it can handle. For example, if you know the age feature would only have values between 1 and 20, you would want to store them as a uint8 instead of an int64. Let's downcast our numerical features and check the memory usage.

data["age"] = data["age"].astype("uint8")
data["win_prob"] = data["win_prob"].astype("float32")

Let us now check the memory usage of the data frame, it is down to 133 MB from the original 305 MB!

4) Storing features as boolean.

We know that our variable, won_before is a binary variable that can be stored as a boolean instead of an object data type.

And, we were able to bring down the size of the data frame to 66.8 MB!

Call for drumrolls, please!

5) Loading a CSV efficiently

Alright, consider you have a scenario where you are to load a CSV file into a data frame.

Instead of doing this:

sample = pd.read_csv("XYZ.csv")

You could load only those columns that you need to work with using the read_csv method and usecols parameter.

sample = pd.read_csv("XYZ.csv",usecols=["team","age","won_before","win_prob"])

6) Conclusion

The aim of the article was to describe some lossless ways of compressing the data to store them more efficiently, you can use lossy compression techniques as well when you are okay to do away with some details of your data.

We discussed, downcasting of data types, using boolean or categorical data types whenever possible, to store your data frames more efficiently.

7) References:

Reducing Pandas memory usage #1: lossless compression

You're loading a CSV into Pandas, and it's using too much RAM: your program crashes if you load the whole thing. How do…

pythonspeed.com

Thanks for reading, if you have any feedback, please feel free to comment below.