lance
lance copied to clipboard
`write_dataset` with multiple streams in parallel
Problem
It's often hard trying to saturate IO throughput of object stores with a single read stream (which usually implies a single file is sequentially read from or written to)
Proposed Solution
Change Dataset::write
from
pub async fn write(
batches: impl RecordBatchReader + Send + 'static,
uri: &str,
params: Option<WriteParams>,
) -> Result<Self>
to
pub async fn write(
batches: impl RecordBatchReader + Send + 'static,
uri: &str,
params: Option<WriteParams>,
) -> Result<Self> {
write_streams([batches], uri, params).await
}
pub async fn write_streams(
streams: [impl RecordBatchReader + Send + 'static],
uri: &str,
params: Option<WriteParams>,
) -> Result<Self>
Each of of the stream are expected to generate separate fragments. e.g. Say we have 2 streams and each stream contains 1001 rows. If we set max_rows_per_file = 1000
, we should get 4 fragments of sizes -- [1000, 1, 1000, 1]
.
This design allows us to parallelize the writes in rust and avoids dealing with painful python threading.