ArcticDB
ArcticDB copied to clipboard
Merge update initial implementation
Reference Issues/PRs
Monday:
What does this implement or fix?
This provides initial implementation of merge functionality supporting only the update part of it. It supports only matching on an ordered DatetimeIndex and static schema.
The algorithm takes advantage that both the source and the target are ordered.
- Iterate over all slices in the index key and produce a list of object describing which of the slices can contain rows in source. This is done by performing lower_bound (binary_search) in the source index, searching for start index value stored in the slice. If the returned value is between key_start_index and key_end_index then the data segment could be affected. The complexity is
O(index_row_count * log(source_row_count)). The information is stored as a pair, the index of the affected slice in the index key and the first index value from the source that falls into that slice. - Only the potentially affected data keys are read.
- For each data key (in parallel) iterate over all index values in source that are between the first and last index values of the data key and perform lower_bound (binary search) to check if the index value from source is in the segment. If it is perform update. Complexity O(source_row_count * log(segment_size))
Next steps:
The iteration in step 3 above is row wise. This will be slow for DataFrames containing UTF string values as reading UTF strings requires holding the GIL and in general row wise iterations are not cache friendly. The reason this initial implementation uses row wise iteration is that it's easier to implement. Column wise iterations would need to either perfrom O(slice_column_count * source_row_count * log(segment_size)) or use a caching mechanism matching source row to segment row another difficulty will be related to having the on clause. With on clause we need to check the entire row (across all segments) to know if update should be performed. The long term plan is to add additional step before update_segment_inplace that will iterate over all slices and generate a list of of pairs (UPDATE/INSERT, row_in_target_segment, row_in_source).
Any other comments?
Checklist
Checklist for code changes...
- [ ] Have you updated the relevant docstrings, documentation and copyright notice?
- [ ] Is this contribution tested against all ArcticDB's features?
- [ ] Do all exceptions introduced raise appropriate error messages?
- [ ] Are API changes highlighted in the PR description?
- [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?