arkouda
arkouda copied to clipboard
Writing fixed sized Parquet files
Parquet files today are written with 1 file-per-locale and the reason for that is because the interface was based off of the HDF5 interface, where we are writing 1-file-per-locale so that we only need to write local portions of arrays and don't need to communicate with the other nodes to pull in remote data, which would result in worse write-speeds, as well as the limitation of HDF5 only being able to write 1 file per process.
An alternate approach was discussed where we would write fixed size files to Parquet, regardless of locality. In other words, say we had a write threshold of 1.5 GBs with a 10 GB DataFrame on 16 locales. We would only write 7 files in that case since we want the files to be as close to 1.5 GB as possible, so we would be sacrificing write speeds to pull the remote data, but our read speeds would be better because we can get optimal file sizes for reading.
One more example, say we had 100 GB DataFrame on 16 locales, we would want to write 67 files.