File formats for pandas Dataframe
File formats
I had to store many GB of data for training. Data source is csv. To avoid parsing csv everytime during my training, I explored different file formats, namely
- feather
- hdf
- numpy
- parquet
- csv
Engineer from Bangalore
I had to store many GB of data for training. Data source is csv. To avoid parsing csv everytime during my training, I explored different file formats, namely
In my earlier posts, I shared why pandas is fast. I use pandas for data munging and keras/tensorflow for building DNN models. After cleaning the data in pandas, when I feed to keras, the training speed was 2x to 4x slower than normal numpy arrays. It’s because pandas uses Fortran order for its internal numpy arrays. Since training happens in batches, it has to access that data row-wise. So, the row-access is slower in fortran order and so the training is slower.
Pandas is a very useful tool for data munging. Pandas is efficient in handling column operations. The secret for its speed is that internally data is stored as numpy arrays in Fortran order. I will explain why pandas is faster.
I needed a logger which automatically logs print statements to logfile in python logging format, without changing many statements in my original code.
I started using Jekyll to create my blog with template as jekyll-now on Github.