greptimedb icon indicating copy to clipboard operation
greptimedb copied to clipboard

Optimize compaction strategy to avoid merging SST files without overlapping time ranges

Open v0y4g3r opened this issue 11 months ago • 6 comments

What type of enhancement is this?

Performance

What does the enhancement do?

With the introduction of new merge tree memtable, we may find SST files locates in L0 can as large as several GiBs. In this case, merging L0 files regardless of their size can be expensive and do little with read performance. We should leverage the "tiered" merging manner that skips large files and those does overlap with others in terms of time ranges.

Tasks

  • [ ] Only filter deleted entries when there's only one file in sorted run #3707
  • [ ] Implement sorted-run based compaction to reduce write amplication #3702

Implementation challenges

No response

v0y4g3r avatar Feb 29 '24 13:02 v0y4g3r

We compact small files with a large file. e.g.

3.8G 4649c98-4edc-444e-a9bd-920d9b88b033.parquet
452M 8fea78f6-ad12-4344-a43d-c93092bfe068.parquet
426M d3a2ddcd-1ff4-4d82-a87d-d47d1f8490ba.parquet
423M fb727d76-2afd-487e-a286-cdeea29048dd.parquet

becomes

4.5G 5d79971b-ce56-4984-af07-5e60fa6f9436.parquet

We might stop compacting a window if the output file is too large.

However, we need to choose files carefully since we remove deleted keys and the deletion marker during compaction so we should compact files with overlapped keys.

evenyag avatar Feb 29 '24 13:02 evenyag

For active windows, we enforce max_sorted_runs (maybe apply to both L0 and L1) and max_files (apply to all files in active window or just L0 in active window?). If current runs exceed that value, we find a cheapest way to achieve that goal.

For example, for an active window with max_sorted_runs=1

compaction

For inactive windows, consider that sometimes there will be out of order insertions we allow no more than one run in noth L0 and L1 (Typical leveled compaction strategy). If current runs exceeds that value, all overlapping files in both l0 and l1 will be picked to be compacted to a new file in l1.

When an active window shifts to inactive window, we check if there is only one sorted run in both l0 and l1 (this does not require another rountine, every compaction should check every time window in current implementation).

compaction

v0y4g3r avatar Mar 29 '24 06:03 v0y4g3r

We should ensure all overlapping files in the same time window are compacted into one sorted run. Since we only have 2 levels and always remove the delete tombstone after compaction. Otherwise, we should add a flag to the merger reader to control whether to remove the delete tombstone. We only filter deleted items when we ensure all overlapping files are compacted.

evenyag avatar Mar 29 '24 07:03 evenyag

We should ensure all overlapping files in the same time window are compacted into one sorted run. Since we only have 2 levels and always remove the delete tombstone after compaction. Otherwise, we should add a flag to the merger reader to control whether to remove the delete tombstone. We only filter deleted items when we ensure all overlapping files are compacted.

Just like Cassandra's size-tiered strategy, tombstones are only dropped when all overlapping files are involved in current compaction.

v0y4g3r avatar Mar 29 '24 07:03 v0y4g3r

I also found a potential issue that compaction jobs might block flush jobs. While using object stores, a compaction job with lots of input files will result in a large amount of file purge tasks. Deleting an object in the object store is slower than in the local disk. These purge tasks use the same background scheduler as flush jobs and slow down flush jobs.

2024-03-28T06:59:10.833054Z  INFO mito2::compaction::twcs: Compacted SST files, input: [FileMeta { region_id: 5175435591680(1205, 0), file_id: FileId(9f93f433-2928-474c-99b8-0e313aeeff81), time_range: (Timestamp { value: 1708489285951, unit: Millisecond }, Timestamp { value: 1708489285951, unit: Millisecond }), level: 0, file_size: 29669, available_indexes: [InvertedIndex], index_file_size: 20331 },
// ......
FileMeta { region_id: 5175435591680(1205, 0), file_id: FileId(c07bf121-2ef6-4e2c-8381-67b27958195f), time_range: (Timestamp { value: 1708489210951, unit: Millisecond }, Timestamp { value: 1708489285951, unit: Millisecond }), level: 0, file_size: 32240, available_indexes: [InvertedIndex], index_file_size: 20333 }, FileMeta { region_id: 5175435591680(1205, 0), file_id: FileId(be30fabe-d0b4-4e54-89c2-ae860b3cd9ae), time_range: (Timestamp { value: 1702940890950, unit: Millisecond }, Timestamp { value: 1708489555951, unit: Millisecond }), level: 1, file_size: 91674531, available_indexes: [InvertedIndex], index_file_size: 41016404 }], output: [FileMeta { region_id: 5175435591680(1205, 0), file_id: FileId(8343abbc-2d5f-4ba4-a8f8-bf8efb4d71a8), time_range: (Timestamp { value: 1702940890950, unit: Millisecond }, Timestamp { value: 1708489555951, unit: Millisecond }), level: 1, file_size: 91674772, available_indexes: [InvertedIndex], index_file_size: 41016403 }], window: Some(31536000)

2024-03-28 14:59:10.899	
2024-03-28T06:59:10.899509Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 9f93f433-2928-474c-99b8-0e313aeeff81, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.903	
2024-03-28T06:59:10.903589Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 9031acad-daa6-44a7-a4d4-a85593236fbb, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.906	
2024-03-28T06:59:10.906535Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 51d57191-4007-495a-8b2b-e4a6e74a77bc, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.907	
2024-03-28T06:59:10.907597Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 7b7b74a8-1062-49ea-9582-cde00f27bf5b, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.927	
2024-03-28T06:59:10.927402Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 53f8d4cb-1af1-4772-9471-a811c7ccf424, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.932	
2024-03-28T06:59:10.932686Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 438cbe35-b94c-4e1e-8671-78d7614acc46, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.937	
2024-03-28T06:59:10.937580Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 94592fe4-f407-4cdd-b875-f7dce7c916aa, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.939	
2024-03-28T06:59:10.938907Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 7d8942b5-edf4-4531-a4b1-6b007bc8d41c, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.958	
2024-03-28T06:59:10.957885Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 2f892e83-8477-46de-89df-6a1d829f6fef, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.969	
2024-03-28T06:59:10.969388Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 9aef716f-6e28-45bb-ab61-b4e585df5c99, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.972	
2024-03-28T06:59:10.972234Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: dd9be29e-4ac5-4fc6-9c3a-00dac12f4995, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.973	
2024-03-28T06:59:10.973382Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 2bbbb635-b09f-4c83-93bc-141211557447, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.991	
2024-03-28T06:59:10.991235Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: b6e7eb67-967c-4b98-bf1a-8ae463708ad8, region: 5175435591680(1205, 0)
2024-03-28 14:59:11.006	
2024-03-28T06:59:11.006431Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: f1da6408-6f8a-45dd-94f3-067d1d444e87, region: 5175435591680(1205, 0)
......

What's more, compaction jobs and flush jobs also use the same background worker pool and job limit.

evenyag avatar Mar 29 '24 07:03 evenyag

I also found a potential issue that compaction jobs might block flush jobs. While using object stores, a compaction job with lots of input files will result in a large amount of file purge tasks. Deleting an object in the object store is slower than in the local disk. These purge tasks use the same background scheduler as flush jobs and slow down flush jobs.

I created a PR https://github.com/GreptimeTeam/greptimedb/pull/3621 for this.

For flush and compaction jobs, I wonder if we should use dedicated schedulers.

evenyag avatar Mar 31 '24 06:03 evenyag

Geat job!

killme2008 avatar Jun 28 '24 23:06 killme2008

@v0y4g3r Assign a doc task to you. https://github.com/GreptimeTeam/docs/issues/1009 We need a detailed document to describe the compaction, it's a critical task for GreptimeDB.

killme2008 avatar Jun 28 '24 23:06 killme2008

@v0y4g3r Assign a doc task to you. GreptimeTeam/docs#1009 We need a detailed document to describe the compaction, it's a critical task for GreptimeDB.

I'll add a separate doc section to describe how compaction works now in v0.9

v0y4g3r avatar Jul 01 '24 07:07 v0y4g3r