Optimizing Memory In Python
Python is one of the most widely used programming languages for Data Science, Data Analytics and Machine Learning. It is gaining popularity since its is easy to pick up for beginners and has some powerful libraries like Pandas, Numpy, Matplotlib and many more which help us in managing and manipulating large amounts of data with ease.
Pandas library in Python allows us to store tabular data with a help of data type called Dataframe. It stores large amount of tabular data and is easy to access using the row and column indices. When we use Pandas to manipulate small scale data, performance is not an issue, but when we use it for large scale data (100 MB - Several GB), performance issues can make runtime longer and consume more of your memory which might result in memory insufficiency.
Limiting Memory is important.If the entire RAM is consumed, the program can crash and throw a MemoryError. Optimizing the memory will not only reduce the memory consumed but will also speed up our computation and help us save time. In this blog, we will discuss, one of the many ways to reduce the memory usage.
Inspecting the memory usage of Dataframe
Lets us first load the pandas package and import our dataset. I've used the Drinksbycountry dataset which is in comma-separated values (csv) format.
We can have a peak view at our dataset using drinks.head() method.
For inspecting the memory consumed by each column, we''ll make use of memory usage function of pandas keeping the 'deep' parameter to be true.
We get memory used by each column in bytes.
Reducing the memory usage of Dataframe
We can see that the columns having object data type consume more of the memory since they are storing strings in them compared to the integers and floating-point numbers which consume significantly less memory.
So why not change the datatype of any particular column with less number of unique values into category datatype.
Obviously, continent column has less unique values, which means we can change its datatype which is object to category. This can be done using the as.type function.
If we take look at our dataset again, we might notice that, the dataframe looks same at the surface. Even though it looks the same, the way it stores the data on the inside has changed. We can confirm this by checking the datatypes of the dataframe.
When the datatype of the continent column is changed to category, its records are stored as integer codes instead of strings. This integer codes in turn refer to the string values in them. This results in reduction of the space taken by the continent column and eventually reduces the memory consumed.
Space taken by the continent column went down from 12332 bytes to 756 bytes which means almost 94% reduction.
In this blog, we saw one of the ways in which we can inspect our memory using memory_usage() method. We also saw how we can optimize the space by changing the datatype. This does not affect how the dataframe looks but reduces the space significantly.
Thank You for giving your precious time and reading this blog post. Express your thoughts in the comment sections and leave a like if you're reading till here.