Feature: Merge Into Optimizations
Merge Into is supported by #12350, but it's not completed. We have many things to do.
A basic implemention :
Optimizations
- [ ] merge into in distributed mode improvement #13970
- [ ] support auto spill for merge into. https://github.com/datafuselabs/databend/issues/13930 assigned to @JackTan25
- [x] full optimizer import for merge into #13950 in merge into runtime filter branch.
- [x] merge into should output the performed affected for insert/delete/update rows, details in #13855 assigned to @JackTan25 https://github.com/datafuselabs/databend/pull/13903
- [ ] reduce segments locations memory for distributed merge into (in fact, we will transform all segment locations
hashmap(segmentId -> segment location), in src/query/sql/src/executor/physical_plans/physical_merge_into.rs' struct MergeInto, the
segmentsfield will hold all segments in every node, in fact every node just need to hold the info it need). (this optimization can be assigned to @JackTan25 or @SkyFan2002 ) - [ ] meta table (reference to BigQuery's big meta paper) (this task is taken by @SkyFan2002 )
- [x] a more powerful hashTable which supports split data block like above.(assign to @JackTan25 ) #14066
- [ ] use bitmap for duplicate check (assign to @JackTan25 or @SkyFan2002)
- [x] optimizer for query source. #13744
- [ ] support not match by source (full outer join).
- [x] insert-only optimization. #13680
- [ ] pre-compute cols for macth_expr/not match expr.
- [x] parallel merge into https://github.com/datafuselabs/databend/pull/13045
- [x] distributed merge into https://github.com/datafuselabs/databend/pull/13151
- [x] support streaming source, table, values #12968 (12968 is enough)
- [x] support computed expr #12905
- [x] support input schema which is not the same with table schema (filling default columns) #12747
- [x] support star "*" function to enhance merge into. #12906
- [x] fetch complete fields for update clause to simplify block split. #12622 (fetch all fields when there is update)
- [ ] matched and unmatched clauses exprs (only target table filed) push down (assign to @JackTan25) #13082
- [x] use if func and merge update and delete expos to optimize when the updates and deletes are separate (advised by @b41sh ) #12622 (a better way)
- [x] use origin block to do insert and update (advised by @b41sh ) #12622
- [x] push down check duplicate and idempotent delete #12780
Hello, may I help with completing this part of the development? I'm quite interested in this aspect.
Hello, may I help with completing this part of the development? I'm quite interested in this aspect.
Sure, thanks.
But we are in the early stage of the merge into design, it's better to wait us some time when the fundamental working is done, we may spill out some tasks, I think then is good.
Hello, may I help with completing this part of the development? I'm quite interested in this aspect.
Sure, thanks.
But we are in the early stage of the merge into design, it's better to wait us some time when the fundamental working is done, we may spill out some tasks, I think then is good.
Perhaps this question may seem a bit forward, but can I also participate in the process of basic design and development for 'merge_into'? I plan to attempt an internship with Databend after completing one or two issues, so I would like to have a deeper understanding of Databend's overall design.
@ct20000901 Feel free to take any task/issue you are interested :)
@ct20000901 you can try this one https://github.com/datafuselabs/databend/issues/12901 firstly. please comment "/assign me" at https://github.com/datafuselabs/databend/issues/12901.