horaedb
horaedb copied to clipboard
Use `interleave` to optimize merge iterator
Describe This Problem
In merge iterator, we extract row by row from the src record batch, and append row by row to the dst record batch now.
However, arrow expose interleave
function for mergine multiple record batch to single batch.
Demo design:
- mixed schema:
let schema = Arc::new(Schema::new(vec![
Field::new("id", DataType::Int64, false),
Field::new("city", DataType::Utf8, false),
Field::new("weight", DataType::Float64, false),
]));
- pure string schema:
let schema = Arc::new(Schema::new(vec![
Field::new("city1", DataType::Utf8, false),
Field::new("city2", DataType::Utf8, false),
Field::new("city3", DataType::Utf8, false),
]));
- data:
let string = vec![shanghai; 8192];
let long_string = vec!["shanghaishanghaishanghai"; 8192];
let f64 = vec![42.0; 8192];
let i64 = vec![1234; capacity];
The demo generates 10 record batch with 8129 rows based on schema and data above, and randomly merge them to a single record batch. Loop it 20000 times.
This is the test result for my simple demo:
row by row | interleave | |
---|---|---|
mixed | 2.19s | 1.76s |
mixed(long string) | 2.54s | 1.95s |
pure string | 3.67s | 2.87s |
pure long string | 9.66s | 5.12s |
Proposal
See title. I think maybe we can use datafusion to merge record batch(has do much optimization work about merging), I will test it in later.
Additional Context
No response