lance icon indicating copy to clipboard operation
lance copied to clipboard

`write_dataset` with multiple streams in parallel

Open chebbyChefNEQ opened this issue 1 year ago • 2 comments

Problem

It's often hard trying to saturate IO throughput of object stores with a single read stream (which usually implies a single file is sequentially read from or written to)

Proposed Solution

Change Dataset::write from

pub async fn write(
    batches: impl RecordBatchReader + Send + 'static,
    uri: &str,
    params: Option<WriteParams>,
) -> Result<Self>

to

pub async fn write(
    batches: impl RecordBatchReader + Send + 'static,
    uri: &str,
    params: Option<WriteParams>,
) -> Result<Self> {
    write_streams([batches], uri, params).await
}

pub async fn write_streams(
    streams: [impl RecordBatchReader + Send + 'static],
    uri: &str,
    params: Option<WriteParams>,
) -> Result<Self>

Each of of the stream are expected to generate separate fragments. e.g. Say we have 2 streams and each stream contains 1001 rows. If we set max_rows_per_file = 1000, we should get 4 fragments of sizes -- [1000, 1, 1000, 1].

This design allows us to parallelize the writes in rust and avoids dealing with painful python threading.

chebbyChefNEQ avatar Aug 22 '24 04:08 chebbyChefNEQ

Related: https://github.com/lancedb/lance/issues/1980

If we want to do the threading in Rust for Python users, that seems reasonable. I'm less certain we need to do this for Rust users, who likely want more control over threading.

wjones127 avatar Aug 22 '24 15:08 wjones127

Related: https://github.com/lancedb/lance/issues/1980

If we want to do the threading in Rust for Python users, that seems reasonable. I'm less certain we need to do this for Rust users, who likely want more control over threading.

Yes, the main motivation is to give Python an easier time with threading. Maybe we just implement this in the Python rust binding instead of the rust core.

chebbyChefNEQ avatar Aug 22 '24 15:08 chebbyChefNEQ