polars icon indicating copy to clipboard operation
polars copied to clipboard

fix(rust): Added proper handling of file.write for large remote csv files

Open WbaN314 opened this issue 1 year ago • 1 comments

Observation

Using the LazyCsvReader in combination with an object store fails in case the read files size exceeds 2MB.

The error is thrown in crates/polars-io/src/file_cache/entry.rs when reading the files from cache and comparing their on disk cache size with the remote size.

Err(Context { error: Context { error: ComputeError(ErrString("downloaded file size (2097152) does not match expected size (12004337)")), msg: ErrString("'csv scan' failed") }, msg: ErrString("'select' input failed to resolve") })

Replication

The problem can easily be replicated, all that is required is an object store with a large enough csv file:

    #[tokio::test(flavor = "multi_thread")]
    async fn polars_read_single_csv() {
        let path = OBJECT_STORE_BASE_URL.to_string()
            + "more_than_2mb_csv_file.csv";
        let path = Path::new(&path);
        let lazy_frame = LazyCsvReader::new(path)
            .finish()
            .unwrap();
        println!("{:?}", lazy_frame.count().collect())
    }

Reason

The file is written to disk using the Tokio file.write API. If successful, this API returns Ok(n) with n being the number of bytes written to the file. n might be smaller than the passed in number of bytes, indicating that not all bytes have been written to the file. This was not properly handled and the file stored in the disk cache might be only the first part of the actual file.

Fix

Loop over the write method until all bytes are written.

WbaN314 avatar Aug 28 '24 09:08 WbaN314

Codecov Report

Attention: Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.

Project coverage is 79.87%. Comparing base (85feb33) to head (793e50a). Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-io/src/cloud/polars_object_store.rs 66.66% 1 Missing :warning:
Additional details and impacted files
@@           Coverage Diff           @@
##             main   #18424   +/-   ##
=======================================
  Coverage   79.86%   79.87%           
=======================================
  Files        1501     1501           
  Lines      201813   201815    +2     
  Branches     2868     2868           
=======================================
+ Hits       161177   161191   +14     
+ Misses      40089    40077   -12     
  Partials      547      547           

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Aug 28 '24 10:08 codecov[bot]

@ritchie46, @orlp, @c-peters, any comments?

WbaN314 avatar Sep 02 '24 06:09 WbaN314

Is there a confirmed Python repro of this bug?

ritchie46 avatar Sep 02 '24 08:09 ritchie46

@ritchie46 yes, the same problem occurs when accessing via the Python wrapper.

import polars as pl

OBJECT_STORE_URL = <censored>

# Lazy scan csv file
lazy_csv = pl.scan_csv(
    OBJECT_STORE_URL + "more_than_2mb_csv_file.csv",
)
lazy_csv.count().collect()

Results in the following error:

File "<path>/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 2027, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: downloaded file size (2097152) does not match expected size (12004337)

This error occurred with the following context stack:
        [1] 'csv scan' failed
        [2] 'select' input failed to resolve

WbaN314 avatar Sep 02 '24 09:09 WbaN314

@nameexhaustion can you take a look at this one?

ritchie46 avatar Sep 02 '24 09:09 ritchie46

Left a comment - there is indeed a bug - we should be using something like write_all() instead.

I tried to but couldn't get a repro with a with a 60M CSV file from S3, I suspect there's some environmental factor (maybe low disk space?) that is needed to trigger it. Fortunately the file cache also does a check at the end on the file size, so currently we at least raise an error instead of silently continuing with truncated data.

nameexhaustion avatar Sep 02 '24 10:09 nameexhaustion

The written content length is exactly the length of the buffer on the file handle. I do not know which external factors determine that buffer size. For me it is set to the 2097152 bytes seen in the error. If this is not reproducible for you, the error will occur once the file is big enough to exceed this buffer size.

Anyways i agree that using write_all is more elegant, did not know of this method. Changed the PR accordingly.

WbaN314 avatar Sep 02 '24 11:09 WbaN314

LGTM 👍

nameexhaustion avatar Sep 02 '24 12:09 nameexhaustion