File formats for pandas Dataframe
File formats
I had to store many GB of data for training. Data source is csv. To avoid parsing csv everytime during my training, I explored different file formats, namely
- feather
- hdf
- numpy
- parquet
- csv
Each file format has its pros and cons.
Code
I used 320MB data to test the fileformat performances. Here is the script I used to evaluate
In [1]: import pandas as pd
In [2]: PATH="/home/karthikeyan/data."
In [4]: %timeit df=pd.read_feather(PATH+"feather")
275 ms ± 2.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [5]: %timeit df=pd.read_hdf(PATH+"hdf")
254 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [7]: %timeit df=pd.read_parquet(PATH+"pqt")
1.61 s ± 88.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [9]: %timeit df=pd.read_csv(PATH+"csv")
20.9 s ± 1.3 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [10]: df=pd.read_parquet(PATH+"pqt")
In [11]: df.info(memory_usage="deep")
#<class 'pandas.core.frame.DataFrame'>
Int64Index: 41208 entries, 0 to 41207
Columns: 4006 entries, LevelFI0 to Hinst
dtypes: object(1), uint16(4005)
memory usage: 319.9 MB
In [12]: np.save(PATH+"npy", df.values)
In [13]: %timeit df = pd.DataFrame(np.load(PATH+"npy"))
8 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [14]: np.save(PATH+"npy2", df.transpose().values)
In [15]: %timeit df = pd.DataFrame(np.load(PATH+"npy2.npy").transpose())
5.55 s ± 123 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [16]: df.shape
Out[16]: (41208, 4006)
Comparison
Ofcourse there are plenty of blogs comparing the file format performance. But I needed to analyse it myself for the size of data that I deal with (~10’s GB). It’s important to note theat my 320 MB data contains unsigned integer and also a column of Strings.
FORMAT | File Size | Time Taken to read | Pros | Cons | |
---|---|---|---|---|---|
feather | 318MB | 275 ms | Very Fast Save | File size similar as csv | |
Read Good for columnar data | |||||
hdf | 319MB | 254 ms | Fairly Fast Save | File size similar as csv | |
Very Fast Read | |||||
numpy | 319MB | 8000 ms | Flat format | File size similar as csv | |
numpy transpose | 319MB | 5500 ms | Flat format | ||
Medium Read | File size similar as csv | ||||
parquet | 77MB | 1610 ms | Fairly Fast Save (4x feather) | ||
Fairly Fast Read (6x feather) File size is much small | Not widely used | ||||
csv | 337MB | 20900 ms | Human readable | Large filesize Slow reading/parsing | |
No good for quick queries |
Overall winner is parquet
in terms of tradeoff between speed and filesize.
But if you want to incrementally add/query data, hdf
is recommended.
If you are looking at plain speed, use feather
. It’s very fast and optimum for column data processing. I have a trick to trick
NOTE:
Transpose trick is something I use often to speedup reading and writing data to dataframe quickly.