polars icon indicating copy to clipboard operation
polars copied to clipboard

Memory mapped compressed feather files

Open thobai opened this issue 2 years ago • 2 comments
trafficstars

Problem description

I wish Polars would support memory mapping for compressed feather files in the same way it does for parquet. This would certainly improve reading performance of scan_ipc especially for subsequent data access where the OS has cached the memory mapped pages. This should make this the fastest option to read / scan data for Polars as there is no conversion required from the layout on disk to the layout in memory.

In a nutshell, I wish polars.scan_ipc(file = 'data.feather.lz4', memory_map = True) would work (equally for zstd compression).

thobai avatar Dec 22 '22 14:12 thobai

That's very interesting. I fully agree with this feature request

arturdaraujo avatar Dec 24 '22 18:12 arturdaraujo

We already open the file memory mapped. I think the most performance benefit is in trying to parallelize the IPC reader but that needs some upstream arrow2 changes.

ritchie46 avatar Dec 27 '22 09:12 ritchie46

@thobai as a workaround you can use filesystem based compression e.g. BTRFS with ZSTD-9.

jmakov avatar Oct 21 '23 00:10 jmakov

@jmakov Thanks for the suggestion. We need to compression as we are occasionally moving files around from S3 to disk and back. So filesystem-based compression doesn't help us.

@ritchie46 Do you know if anything happened on the arrow2 side? Is there an open request for parallelizing the IPC reader? I don't quite understand what is missing so I can't write that request in case there's none.

thobai avatar Oct 23 '23 07:10 thobai