polars
polars copied to clipboard
fix(rust): Added proper handling of file.write for large remote csv files
Observation
Using the LazyCsvReader in combination with an object store fails in case the read files size exceeds 2MB.
The error is thrown in crates/polars-io/src/file_cache/entry.rs when reading the files from cache and comparing their on disk cache size with the remote size.
Err(Context { error: Context { error: ComputeError(ErrString("downloaded file size (2097152) does not match expected size (12004337)")), msg: ErrString("'csv scan' failed") }, msg: ErrString("'select' input failed to resolve") })
Replication
The problem can easily be replicated, all that is required is an object store with a large enough csv file:
#[tokio::test(flavor = "multi_thread")]
async fn polars_read_single_csv() {
let path = OBJECT_STORE_BASE_URL.to_string()
+ "more_than_2mb_csv_file.csv";
let path = Path::new(&path);
let lazy_frame = LazyCsvReader::new(path)
.finish()
.unwrap();
println!("{:?}", lazy_frame.count().collect())
}
Reason
The file is written to disk using the Tokio file.write API. If successful, this API returns Ok(n) with n being the number of bytes written to the file. n might be smaller than the passed in number of bytes, indicating that not all bytes have been written to the file. This was not properly handled and the file stored in the disk cache might be only the first part of the actual file.
Fix
Loop over the write method until all bytes are written.
Codecov Report
Attention: Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.
Project coverage is 79.87%. Comparing base (
85feb33) to head (793e50a). Report is 6 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| crates/polars-io/src/cloud/polars_object_store.rs | 66.66% | 1 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #18424 +/- ##
=======================================
Coverage 79.86% 79.87%
=======================================
Files 1501 1501
Lines 201813 201815 +2
Branches 2868 2868
=======================================
+ Hits 161177 161191 +14
+ Misses 40089 40077 -12
Partials 547 547
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@ritchie46, @orlp, @c-peters, any comments?
Is there a confirmed Python repro of this bug?
@ritchie46 yes, the same problem occurs when accessing via the Python wrapper.
import polars as pl
OBJECT_STORE_URL = <censored>
# Lazy scan csv file
lazy_csv = pl.scan_csv(
OBJECT_STORE_URL + "more_than_2mb_csv_file.csv",
)
lazy_csv.count().collect()
Results in the following error:
File "<path>/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 2027, in collect
return wrap_df(ldf.collect(callback))
^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: downloaded file size (2097152) does not match expected size (12004337)
This error occurred with the following context stack:
[1] 'csv scan' failed
[2] 'select' input failed to resolve
@nameexhaustion can you take a look at this one?
Left a comment - there is indeed a bug - we should be using something like write_all() instead.
I tried to but couldn't get a repro with a with a 60M CSV file from S3, I suspect there's some environmental factor (maybe low disk space?) that is needed to trigger it. Fortunately the file cache also does a check at the end on the file size, so currently we at least raise an error instead of silently continuing with truncated data.
The written content length is exactly the length of the buffer on the file handle. I do not know which external factors determine that buffer size. For me it is set to the 2097152 bytes seen in the error. If this is not reproducible for you, the error will occur once the file is big enough to exceed this buffer size.
Anyways i agree that using write_all is more elegant, did not know of this method. Changed the PR accordingly.
LGTM 👍