horaedb icon indicating copy to clipboard operation
horaedb copied to clipboard

Use `interleave` to optimize merge iterator

Open Rachelint opened this issue 1 year ago • 0 comments

Describe This Problem

In merge iterator, we extract row by row from the src record batch, and append row by row to the dst record batch now. However, arrow expose interleave function for mergine multiple record batch to single batch.

Demo design:

  • mixed schema:
    let schema = Arc::new(Schema::new(vec![
        Field::new("id", DataType::Int64, false),
        Field::new("city", DataType::Utf8, false),
        Field::new("weight", DataType::Float64, false),
    ]));
  • pure string schema:
    let schema = Arc::new(Schema::new(vec![
        Field::new("city1", DataType::Utf8, false),
        Field::new("city2", DataType::Utf8, false),
        Field::new("city3", DataType::Utf8, false),
    ]));
  • data:
    let string = vec![shanghai; 8192];
    let long_string = vec!["shanghaishanghaishanghai"; 8192];
    let f64 = vec![42.0; 8192];
    let i64 = vec![1234; capacity];

The demo generates 10 record batch with 8129 rows based on schema and data above, and randomly merge them to a single record batch. Loop it 20000 times.

This is the test result for my simple demo:

row by row interleave
mixed 2.19s 1.76s
mixed(long string) 2.54s 1.95s
pure string 3.67s 2.87s
pure long string 9.66s 5.12s

Proposal

See title. I think maybe we can use datafusion to merge record batch(has do much optimization work about merging), I will test it in later.

Additional Context

No response

Rachelint avatar Jul 05 '23 03:07 Rachelint