File formats for pandas Dataframe

File formats

I had to store many GB of data for training. Data source is csv. To avoid parsing csv everytime during my training, I explored different file formats, namely

  1. feather
  2. hdf
  3. numpy
  4. parquet
  5. csv
Read More

Pandas for training

In my earlier posts, I shared why pandas is fast. I use pandas for data munging and keras/tensorflow for building DNN models. After cleaning the data in pandas, when I feed to keras, the training speed was 2x to 4x slower than normal numpy arrays. It’s because pandas uses Fortran order for its internal numpy arrays. Since training happens in batches, it has to access that data row-wise. So, the row-access is slower in fortran order and so the training is slower.

Read More

Pandas, why is it fast?

Pandas

Pandas is a very useful tool for data munging. Pandas is efficient in handling column operations. The secret for its speed is that internally data is stored as numpy arrays in Fortran order. I will explain why pandas is faster.

Read More