greptimedb
greptimedb copied to clipboard
Optimize compaction strategy to avoid merging SST files without overlapping time ranges
What type of enhancement is this?
Performance
What does the enhancement do?
With the introduction of new merge tree memtable, we may find SST files locates in L0 can as large as several GiBs. In this case, merging L0 files regardless of their size can be expensive and do little with read performance. We should leverage the "tiered" merging manner that skips large files and those does overlap with others in terms of time ranges.
Tasks
- [ ] Only filter deleted entries when there's only one file in sorted run #3707
- [ ] Implement sorted-run based compaction to reduce write amplication #3702
Implementation challenges
No response
We compact small files with a large file. e.g.
3.8G 4649c98-4edc-444e-a9bd-920d9b88b033.parquet
452M 8fea78f6-ad12-4344-a43d-c93092bfe068.parquet
426M d3a2ddcd-1ff4-4d82-a87d-d47d1f8490ba.parquet
423M fb727d76-2afd-487e-a286-cdeea29048dd.parquet
becomes
4.5G 5d79971b-ce56-4984-af07-5e60fa6f9436.parquet
We might stop compacting a window if the output file is too large.
However, we need to choose files carefully since we remove deleted keys and the deletion marker
during compaction so we should compact files with overlapped keys.
For active windows, we enforce max_sorted_runs (maybe apply to both L0 and L1) and max_files (apply to all files in active window or just L0 in active window?). If current runs exceed that value, we find a cheapest way to achieve that goal.
For example, for an active window with max_sorted_runs=1
For inactive windows, consider that sometimes there will be out of order insertions we allow no more than one run in noth L0 and L1 (Typical leveled compaction strategy). If current runs exceeds that value, all overlapping files in both l0 and l1 will be picked to be compacted to a new file in l1.
When an active window shifts to inactive window, we check if there is only one sorted run in both l0 and l1 (this does not require another rountine, every compaction should check every time window in current implementation).
We should ensure all overlapping files in the same time window are compacted into one sorted run. Since we only have 2 levels and always remove the delete tombstone after compaction. Otherwise, we should add a flag to the merger reader to control whether to remove the delete tombstone. We only filter deleted items when we ensure all overlapping files are compacted.
We should ensure all overlapping files in the same time window are compacted into one sorted run. Since we only have 2 levels and always remove the delete tombstone after compaction. Otherwise, we should add a flag to the merger reader to control whether to remove the delete tombstone. We only filter deleted items when we ensure all overlapping files are compacted.
Just like Cassandra's size-tiered strategy, tombstones are only dropped when all overlapping files are involved in current compaction.
I also found a potential issue that compaction jobs might block flush jobs. While using object stores, a compaction job with lots of input files will result in a large amount of file purge tasks. Deleting an object in the object store is slower than in the local disk. These purge tasks use the same background scheduler as flush jobs and slow down flush jobs.
2024-03-28T06:59:10.833054Z INFO mito2::compaction::twcs: Compacted SST files, input: [FileMeta { region_id: 5175435591680(1205, 0), file_id: FileId(9f93f433-2928-474c-99b8-0e313aeeff81), time_range: (Timestamp { value: 1708489285951, unit: Millisecond }, Timestamp { value: 1708489285951, unit: Millisecond }), level: 0, file_size: 29669, available_indexes: [InvertedIndex], index_file_size: 20331 },
// ......
FileMeta { region_id: 5175435591680(1205, 0), file_id: FileId(c07bf121-2ef6-4e2c-8381-67b27958195f), time_range: (Timestamp { value: 1708489210951, unit: Millisecond }, Timestamp { value: 1708489285951, unit: Millisecond }), level: 0, file_size: 32240, available_indexes: [InvertedIndex], index_file_size: 20333 }, FileMeta { region_id: 5175435591680(1205, 0), file_id: FileId(be30fabe-d0b4-4e54-89c2-ae860b3cd9ae), time_range: (Timestamp { value: 1702940890950, unit: Millisecond }, Timestamp { value: 1708489555951, unit: Millisecond }), level: 1, file_size: 91674531, available_indexes: [InvertedIndex], index_file_size: 41016404 }], output: [FileMeta { region_id: 5175435591680(1205, 0), file_id: FileId(8343abbc-2d5f-4ba4-a8f8-bf8efb4d71a8), time_range: (Timestamp { value: 1702940890950, unit: Millisecond }, Timestamp { value: 1708489555951, unit: Millisecond }), level: 1, file_size: 91674772, available_indexes: [InvertedIndex], index_file_size: 41016403 }], window: Some(31536000)
2024-03-28 14:59:10.899
2024-03-28T06:59:10.899509Z INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 9f93f433-2928-474c-99b8-0e313aeeff81, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.903
2024-03-28T06:59:10.903589Z INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 9031acad-daa6-44a7-a4d4-a85593236fbb, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.906
2024-03-28T06:59:10.906535Z INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 51d57191-4007-495a-8b2b-e4a6e74a77bc, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.907
2024-03-28T06:59:10.907597Z INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 7b7b74a8-1062-49ea-9582-cde00f27bf5b, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.927
2024-03-28T06:59:10.927402Z INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 53f8d4cb-1af1-4772-9471-a811c7ccf424, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.932
2024-03-28T06:59:10.932686Z INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 438cbe35-b94c-4e1e-8671-78d7614acc46, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.937
2024-03-28T06:59:10.937580Z INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 94592fe4-f407-4cdd-b875-f7dce7c916aa, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.939
2024-03-28T06:59:10.938907Z INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 7d8942b5-edf4-4531-a4b1-6b007bc8d41c, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.958
2024-03-28T06:59:10.957885Z INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 2f892e83-8477-46de-89df-6a1d829f6fef, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.969
2024-03-28T06:59:10.969388Z INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 9aef716f-6e28-45bb-ab61-b4e585df5c99, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.972
2024-03-28T06:59:10.972234Z INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: dd9be29e-4ac5-4fc6-9c3a-00dac12f4995, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.973
2024-03-28T06:59:10.973382Z INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 2bbbb635-b09f-4c83-93bc-141211557447, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.991
2024-03-28T06:59:10.991235Z INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: b6e7eb67-967c-4b98-bf1a-8ae463708ad8, region: 5175435591680(1205, 0)
2024-03-28 14:59:11.006
2024-03-28T06:59:11.006431Z INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: f1da6408-6f8a-45dd-94f3-067d1d444e87, region: 5175435591680(1205, 0)
......
What's more, compaction jobs and flush jobs also use the same background worker pool and job limit.
I also found a potential issue that compaction jobs might block flush jobs. While using object stores, a compaction job with lots of input files will result in a large amount of file purge tasks. Deleting an object in the object store is slower than in the local disk. These purge tasks use the same background scheduler as flush jobs and slow down flush jobs.
I created a PR https://github.com/GreptimeTeam/greptimedb/pull/3621 for this.
For flush and compaction jobs, I wonder if we should use dedicated schedulers.
Geat job!
@v0y4g3r Assign a doc task to you. https://github.com/GreptimeTeam/docs/issues/1009 We need a detailed document to describe the compaction, it's a critical task for GreptimeDB.
@v0y4g3r Assign a doc task to you. GreptimeTeam/docs#1009 We need a detailed document to describe the compaction, it's a critical task for GreptimeDB.
I'll add a separate doc section to describe how compaction works now in v0.9