Support empty segments for realtime tables
Addresses #12703
This patch allows persistence of empty segments in case of no messages in realtime ingestion. Right now, it can happen that a segment is stuck in CONSUMING state for months / years if no offset advances for it. We also add support for retention manager finding empty segments aggressively and purging them.
Codecov Report
Attention: Patch coverage is 93.33333% with 2 lines in your changes missing coverage. Please review.
Project coverage is 61.98%. Comparing base (
59551e4) to head (12a7068). Report is 677 commits behind head on master.
| Files | Patch % | Lines |
|---|---|---|
| .../pinot/core/data/manager/BaseTableDataManager.java | 50.00% | 1 Missing :warning: |
| .../starter/helix/HelixInstanceDataManagerConfig.java | 0.00% | 1 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #13362 +/- ##
============================================
+ Coverage 61.75% 61.98% +0.23%
+ Complexity 207 198 -9
============================================
Files 2436 2559 +123
Lines 133233 141215 +7982
Branches 20636 21916 +1280
============================================
+ Hits 82274 87533 +5259
- Misses 44911 47025 +2114
- Partials 6048 6657 +609
| Flag | Coverage Δ | |
|---|---|---|
| custom-integration1 | <0.01% <0.00%> (-0.01%) |
:arrow_down: |
| integration | <0.01% <0.00%> (-0.01%) |
:arrow_down: |
| integration1 | <0.01% <0.00%> (-0.01%) |
:arrow_down: |
| integration2 | 0.00% <0.00%> (ø) |
|
| java-11 | 61.94% <93.33%> (+0.23%) |
:arrow_up: |
| java-21 | 61.86% <93.33%> (+0.24%) |
:arrow_up: |
| skip-bytebuffers-false | 61.97% <93.33%> (+0.23%) |
:arrow_up: |
| skip-bytebuffers-true | 61.83% <93.33%> (+34.10%) |
:arrow_up: |
| temurin | 61.98% <93.33%> (+0.23%) |
:arrow_up: |
| unittests | 61.98% <93.33%> (+0.23%) |
:arrow_up: |
| unittests1 | 46.67% <66.66%> (-0.22%) |
:arrow_down: |
| unittests2 | 27.51% <86.66%> (-0.22%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Sajjad and I will review this and get back to you. Please hold merging for a bit.
I believe we already allow empty segments during force commit, so that by itself should not be a problem. @sajjad-moradi ?
I believe we already allow empty segments during force commit, so that by itself should not be a problem.
+1! And we are ensuring here that we don't remove the last committed empty segment for keeping the metadata info so should be pretty safe imo.
Committing an empty segment is fine. I'm concerned that deleting the empty ones might make debugging a bit harder. If something goes wrong with segments of a partition, it would be hard by looking at IdealState, ExternalView, or SEGMENTS in PropertyStore to distinguish if a segment was purged due no-data case or something has indeed gone wrong.
I'm concerned that deleting the empty ones might make debugging a bit harder. If something goes wrong with segments of a partition, it would be hard by looking at IdealState, ExternalView, or SEGMENTS in PropertyStore to distinguish if a segment was purged due no-data case or something has indeed gone wrong.
In that case, we can enable the empty segment retention manager behind a config: deleteEmptySegments by default we can keep it to false. Ideally keeping a lot of empty segments in a table doesn't make sense if the commit threshold time is few hours and retention is high. Wdyt?
What is the current mechanism of removing empty segments?
What is the current mechanism of removing empty segments?
I believe the time-based retention manager should handle this since we get segment.end.time information in the metadata for empty segments as well. However, a specific scenario for upserts is that if we set indefinite retention and rely on UpsertCompaction to purge segments, empty segments won't be included. This is because we don't add EmptyIndexSegment to _trackedSegments, which ultimately leads to them being skipped in the snapshot flow as well.
Good point on the infinite retention case. I'm okay always removing empty segment as long as it is not the last committed one (used to resume consumption or mark the end of a partition). I think it is okay because removed segments are either expired or empty. I don't see much value keeping all empty segments around even for debugging purpose. @sajjad-moradi wdyt?
We won't have a lot of segments if we don't commit empty segments. A good thing about the existing behavior is that when we see a consuming segment that's started a long time ago, we can immediately tell that there hasn't been any new events on the stream, without needing for further investigation!
To get around the issue that this PR is trying to fix, can't we just issue a force commit, and then deleted those segments? I don't like the idea of committing the empty consuming segments and then removing them in another task, retention manager. Retention manager, as its name suggests, should only delete segments that are older than the retention of the table.
If you guys still think it's better to go with the approach suggested in this PR, let's have a config for committing empty consuming segment, and keep the existing behavior behind the default value of that config.
@sajjad-moradi One problem with force commit is that it won't commit the empty segment for the same reason. I'm okay adding a config to allow committing empty segment, then allow retention manager to remove it if it is not the last committed segment.
when we see a consuming segment that's started a long time ago, we can immediately tell that there hasn't been any new events on the stream, without needing for further investigation!
We can monitor various metrics to determine if this scenario occurs after allowing this empty segments behaviour.
I will move this empty segment behaviour behind a config but somehow I am not convinced to not make this default behaviour just because of better debuggability. It's just that overtime we will have to manage a lot of configs for getting the correct functionality out of Pinot.
There are at least a couple of more considerations I can think of. @tibrewalpratik17 can you please follow up on these?
- The server keeps a rolling statistics buffer on one of the segments of the table that helps in deciding how much memory to allocate for consuming segments. A commit of a zero-sized segment may end up skewing these statistics unfavorably. See ReatimeSegmentStatsHistory.
- When one segment completes, the controller computes the number of rows to be consumed for the next segment, based on the size of the current segment. Will this computation also get skewed with zero-sized segments? See FlushThresholdUpdater and implementations.
Thanks @mcvsubbu for the pointers!
Briefly going through the code, I observe the following:
The server keeps a rolling statistics buffer on one of the segments of the table that helps in deciding how much memory to allocate for consuming segments. A commit of a zero-sized segment may end up skewing these statistics unfavorably. See ReatimeSegmentStatsHistory.
I noticed that we only update the stats when _numRowsIndexed is strictly greater than 0 (Ref). Therefore, the behavior remains the same: if we do not commit an empty segment, the stats history is not updated, and this applies to the current scenario as well.
When one segment completes, the controller computes the number of rows to be consumed for the next segment, based on the size of the current segment. Will this computation also get skewed with zero-sized segments? See FlushThresholdUpdater and implementations.
Reviewing the implementation of SegmentSizeBasedFlushThresholdUpdater, it appears that empty segments are already handled (Ref), as the new threshold is set to the same value as the previous committed segment threshold. In the case of committing consecutive empty segments, if a table's partition has never received a message, it defaults to the autotuneInitialRows configuration value (Ref).
After a second thought, the root cause of the problem is that: retention manager always keeps the last completed segment even if it is already expired. I think we need to fix this behavior because it is violating the retention agreement. @sajjad-moradi Do you think consumption pause/resume can work if we cleanup the last committed segment? Basically we should resume from the earliest offset @KKcorps Does Kinesis work if we clean up the last committed segment for a closed partition?
retention manager always keeps the last completed segment even if it is already expired. I think we need to fix this behavior because it is violating the retention agreement.
but for very high / indefinite retention we would still end up with the consuming segment open for quite long.
@tibrewalpratik17 Consuming segment will be opened for long time only if there is no event consumed. As long as we can clean up all the expired segments it should be fine