polars
polars copied to clipboard
Inconsistency among write_ipc, sink_ipc, and scan_ipc
Checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
import polars as pl
pl.DataFrame({'values': [0, 1, 2]}).lazy().sink_ipc('example.out')
pl.scan_ipc('example.out').collect()
Log output
Could not mmap compressed IPC file, defaulting to normal read. Toggle off 'memory_map' to silence this warning.
Issue description
Reading documentation, one learns that:
-
default compression algorithm for
write_ipc
isuncompressed
-
default compression algorithm for
sink_ipc
iszstd
-
unless explicitly specified otherwise,
scan_ipc
assumes it's reading uncompressed ipc file -
according to documentation,
uncompressed
isn't among allowed compression algorithms
This leads to super inconsistent behaviour where stuff suddenly breaks when one replaces write_ipc
with its lazy version. Moreover, the fact that default behaviour scan_ipc
isn't compatible with the default behaviour of sink_ipc
is confusing as well.
Expected behavior
At minimum, sink_ipc
followed by scan_ipc
should not emit warnings. This can be achieved either by disabling default memory mapping in scan_ipc
or by changing default compression to uncompressed
.
Ideally, sync and lazy versions should follow the same defaults.
Installed versions
----Optional dependencies----
adbc_driver_sqlite:
sink_ipc
doesn't even have the option to set compression to uncompressed
. I wonder what's the reason?
I hope I can sink_ipc
uncompressed so that I can later scan_ipc
mempry-mapped.
I also struggle to sink IPC uncompressed, for later mmap use. I have large amount of data, not fit for RAM.
The only option seems to lazy_df.collect().write_ipc()
. But my data is too large.. This undermines whole concept of Lazy API.