File formats for pandas Dataframe

File formats

I had to store many GB of data for training. Data source is csv. To avoid parsing csv everytime during my training, I explored different file formats, namely

feather
hdf
numpy
parquet
csv

Each file format has its pros and cons.

Code

I used 320MB data to test the fileformat performances. Here is the script I used to evaluate

In [1]: import pandas as pd

In [2]: PATH="/home/karthikeyan/data."

In [4]: %timeit df=pd.read_feather(PATH+"feather")
275 ms ± 2.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %timeit df=pd.read_hdf(PATH+"hdf")
254 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: %timeit df=pd.read_parquet(PATH+"pqt")
1.61 s ± 88.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: %timeit df=pd.read_csv(PATH+"csv")
20.9 s ± 1.3 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [10]: df=pd.read_parquet(PATH+"pqt")

In [11]: df.info(memory_usage="deep")
#<class 'pandas.core.frame.DataFrame'>
Int64Index: 41208 entries, 0 to 41207
Columns: 4006 entries, LevelFI0 to Hinst
dtypes: object(1), uint16(4005)
memory usage: 319.9 MB


In [12]: np.save(PATH+"npy", df.values)

In [13]: %timeit df = pd.DataFrame(np.load(PATH+"npy"))
8 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [14]: np.save(PATH+"npy2", df.transpose().values)

In [15]: %timeit df = pd.DataFrame(np.load(PATH+"npy2.npy").transpose())
5.55 s ± 123 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [16]: df.shape
Out[16]: (41208, 4006)

Comparison

Ofcourse there are plenty of blogs comparing the file format performance. But I needed to analyse it myself for the size of data that I deal with (~10’s GB). It’s important to note theat my 320 MB data contains unsigned integer and also a column of Strings.

FORMAT	File Size	Time Taken to read	Pros	Cons
feather	318MB	275 ms	Very Fast Save	File size similar as csv
			Read Good for columnar data
hdf	319MB	254 ms	Fairly Fast Save	File size similar as csv
			Very Fast Read
numpy	319MB	8000 ms	Flat format	File size similar as csv
numpy transpose	319MB	5500 ms	Flat format
			Medium Read	File size similar as csv
parquet	77MB	1610 ms	Fairly Fast Save (4x feather)
			Fairly Fast Read (6x feather) File size is much small	Not widely used
csv	337MB	20900 ms	Human readable	Large filesize Slow reading/parsing
				No good for quick queries

Overall winner is parquet in terms of tradeoff between speed and filesize. But if you want to incrementally add/query data, hdf is recommended. If you are looking at plain speed, use feather. It’s very fast and optimum for column data processing. I have a trick to trick

NOTE:

Transpose trick is something I use often to speedup reading and writing data to dataframe quickly.

Written on May 27, 2018