lance icon indicating copy to clipboard operation
lance copied to clipboard

`write_dataset` with multiple streams in parallel

Open chebbyChefNEQ opened this issue 6 months ago • 2 comments

Problem

It's often hard trying to saturate IO throughput of object stores with a single read stream (which usually implies a single file is sequentially read from or written to)

Proposed Solution

Change Dataset::write from

pub async fn write(
    batches: impl RecordBatchReader + Send + 'static,
    uri: &str,
    params: Option<WriteParams>,
) -> Result<Self>

to

pub async fn write(
    batches: impl RecordBatchReader + Send + 'static,
    uri: &str,
    params: Option<WriteParams>,
) -> Result<Self> {
    write_streams([batches], uri, params).await
}

pub async fn write_streams(
    streams: [impl RecordBatchReader + Send + 'static],
    uri: &str,
    params: Option<WriteParams>,
) -> Result<Self>

Each of of the stream are expected to generate separate fragments. e.g. Say we have 2 streams and each stream contains 1001 rows. If we set max_rows_per_file = 1000, we should get 4 fragments of sizes -- [1000, 1, 1000, 1].

This design allows us to parallelize the writes in rust and avoids dealing with painful python threading.

chebbyChefNEQ avatar Aug 22 '24 04:08 chebbyChefNEQ