polars icon indicating copy to clipboard operation
polars copied to clipboard

Inconsistency among write_ipc, sink_ipc, and scan_ipc

Open hleumas opened this issue 1 year ago • 2 comments

Checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

pl.DataFrame({'values': [0, 1, 2]}).lazy().sink_ipc('example.out')
pl.scan_ipc('example.out').collect()

Log output

Could not mmap compressed IPC file, defaulting to normal read. Toggle off 'memory_map' to silence this warning.

Issue description

Reading documentation, one learns that:

This leads to super inconsistent behaviour where stuff suddenly breaks when one replaces write_ipc with its lazy version. Moreover, the fact that default behaviour scan_ipc isn't compatible with the default behaviour of sink_ipc is confusing as well.

Expected behavior

At minimum, sink_ipc followed by scan_ipc should not emit warnings. This can be achieved either by disabling default memory mapping in scan_ipc or by changing default compression to uncompressed.

Ideally, sync and lazy versions should follow the same defaults.

Installed versions

--------Version info--------- Polars: 0.19.3 Index type: UInt32 Platform: macOS-13.6-arm64-arm-64bit Python: 3.11.4 (main, Jun 20 2023, 17:23:00) [Clang 14.0.3 (clang-1403.0.22.14.1)]

----Optional dependencies---- adbc_driver_sqlite: cloudpickle: connectorx: deltalake: fsspec: gevent: matplotlib: numpy: 1.25.2 pandas: 2.1.1 pyarrow: 13.0.0 pydantic: sqlalchemy: xlsx2csv: xlsxwriter:

hleumas avatar Oct 07 '23 11:10 hleumas

sink_ipc doesn't even have the option to set compression to uncompressed. I wonder what's the reason?

howsiyu avatar Oct 30 '23 14:10 howsiyu

I hope I can sink_ipc uncompressed so that I can later scan_ipc mempry-mapped.

mutecamel avatar Jan 11 '24 11:01 mutecamel

I also struggle to sink IPC uncompressed, for later mmap use. I have large amount of data, not fit for RAM.

The only option seems to lazy_df.collect().write_ipc(). But my data is too large.. This undermines whole concept of Lazy API.

dankal444 avatar May 07 '24 15:05 dankal444