polars
polars copied to clipboard
Memory mapped compressed feather files
Problem description
I wish Polars would support memory mapping for compressed feather files in the same way it does for parquet. This would certainly improve reading performance of scan_ipc especially for subsequent data access where the OS has cached the memory mapped pages. This should make this the fastest option to read / scan data for Polars as there is no conversion required from the layout on disk to the layout in memory.
In a nutshell, I wish polars.scan_ipc(file = 'data.feather.lz4', memory_map = True) would work (equally for zstd compression).
That's very interesting. I fully agree with this feature request
We already open the file memory mapped. I think the most performance benefit is in trying to parallelize the IPC reader but that needs some upstream arrow2 changes.
@thobai as a workaround you can use filesystem based compression e.g. BTRFS with ZSTD-9.
@jmakov Thanks for the suggestion. We need to compression as we are occasionally moving files around from S3 to disk and back. So filesystem-based compression doesn't help us.
@ritchie46 Do you know if anything happened on the arrow2 side? Is there an open request for parallelizing the IPC reader? I don't quite understand what is missing so I can't write that request in case there's none.