druid Add a configurable bufferPeriod between when a segment is marked unused and deleted by KillUnusedSegments duty

This PR addresses part of the proposal in #12526

Description

Made an incompatible change to druid_segments table

Added a new column, ~last_used VARCHAR(255)~ used_flag_last_updated VARCHAR(255) to the druid_segments table. When creating the druid_segments table from scratch, this column is not nullable. However, to more easily facilitate an upgrade to a version of Druid with this code, the migration steps to ALTER druid_segments allow the column to have null values. ~Perhaps we can document an optional post upgrade step to ALTER the table to not allow nulls that is contingent upon completing the existing optional post-upgrade step to populate last_used for already unused rows.~ The coordinator will incrementally update these nulls to valid date strings for all already unused segments post-upgrade. This automatically brings the metastore into compliance with the new version of druid.

This column is a UTC date string corresponding to the last time that the used column was modified.

Added a configuration to the `druid.coordinator.kill.` family, `druid.coordinator.kill.bufferPeriod`

druid.coordinator.kill.bufferPeriod is a Duration that defines the amount of time that a segment must have been unused for before KillUnusedSegments will potentially kill it. For example, with the default PT24H. If I mark a segment X unused at 2022-06-01T00:05:00.000Z. By rule, this segment cannot be killed until at or after 2022-06-02T00:05:00.000Z.

The most prominent use of this new configuration is to prevent deletion of data from Druid by mistake. It can be thought of like the trash folder in HDFS. Marking segments unused is the rm. Without trash, the data is just gone. With trash, we can recover. The period is the amount of time before pending trash sent away for good.

Added a single threaded Executor on the coordinator for updating null values of new column

Segments who are already unused at the time of the upgrade will have a null value for used_flag_last_updated. In this state, these segments will never be killed. The Coordinator will start a background thread to update all of these null values for segments who are already marked unused (segments with null values who are still used don't need a valid date string for this column. They will get one if/when they are marked unused).

Alternative Design

Alternatively, I could have embedded last_used within the payload. The functionality at the end of the day would be the same, but I have listed some pros and cons below.

Pros

No table schema change needed to upgrade Druid, making life easier for an operator.

Cons

If you were adding this feature in the initial pre-release development of Druid, you'd probably steer away from accumulating too much in the payload column and instead create native columns in the database table
- embedding it in this column is almost hiding it. Or at least making it harder for others to discover and understand.
You need to pull the payload back in the query for unused segments. Increased db I/O, network, druid memory and compute demands
You need to process the payloads in Druid code instead of simply filtering on the column value if it were native to the metastore table

Upgrade Plan

~Requiring a database table alteration is a breaking change. It introduces an extra requirement for an operator to upgrade from an older version of Druid. I have done my best to limit the complexity of upgrading and provide as much assistance as possible to the Druid operators out there.~

Requiring a database table alteration is a breaking change for any cluster whose coordinator doesn't have DDL permission on startup. If there are no such privileges, the operator of the cluster will need to handle adding the column in order to update to 0.24+. I have done my best to limit the complexity of upgrading and provide as much assistance as possible to the Druid operators out there.

No work necessary case

There is a case where there is no extra work necessary for the Druid operator to upgrade. If the operator has used the default of true for druid.metadata.storage.connector.createTables, and their metastore user has DDL privs, the table will automatically be altered at startup if the ~last_used~ used_flag_last_updated column is missing from the druid_segments table.

The coordinator and overlord will startup just fine. And all future changes to the used column (plus new segment creation) will populate ~last_used~ used_flag_last_updated. For existing segments who match used = true, the value of ~last_used~ used_flag_last_updated will be null. ~I have coded up the logic for finding segments to kill in a way that ignores the bufferPeriod for these segments with null column values. This means that KillUnusedSegments will use the same logic as pre-upgrade when looking for killable segments in the metadata store.~ The segment killing mechanism will ignore these rows with null used_flag_last_updated. The Coordinator will eventually populate the column with a real date string, thus starting the bufferPeriod clock.

Some work necessary case

If the operator has either:

set druid.metadata.storage.connector.createTables to false
not given their metastore user DDL privs

They will need to execute an ALTER statement themselves before the upgrade. To make things easy for them, I have created a new Druid cli tool called UpdateTables that can perform this for them by executing the same alter table code path as the no work necessary case described above. This tool, as well as the actual ALTER command - in case the operator wants to do it themselves - is documented in a new upgrade-prep.md resource.

Optional post upgrade action

~If the operator is interested in populating all of their legacy unused segments with a last_used date string following the upgrade, I have included another action in the new UpdateTables cli tool that will do so. Again, I have documented the tool and provided the actual UPDATE statement that they could use manually. This step is completely optional and only needed if the operator desires that their legacy unused segments come under the purview of the new bufferPeriod config post-upgrade.~

This is no longer required now that the coordinator will handle populating used_flag_last_updated with already unused segments at the time of the upgrade.

Key changed/added classes in this PR

MetadataStorageConnector - added 2 methods to interface
IndexerSQLMetadataStorageCoordinator - create new segments with last_used column populated
SQLMetadataConnector - code for altering and validating the druid_segments table
SQLMetadataSegmentPublisher - publish segments with last_used column populated
SegmentsMetadataManager - update signature of interface method
SqlSegmentsMetadataManager - manage and use last_used column properly
SqlSegmentsMetadataQuery.java - manage last_used properly
DruidCoordinatorConfig - new config
Main - new cli tool
UpdateTables - new cli tool

This PR has:

[x] been self-reviewed.
[x] added documentation for new or modified features or behaviors.
[ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
[ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
[ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
[ ] added integration tests.
[x] been tested in a test Druid cluster.

Jun 02 '22 21:06 capistrant

#12404 (currently-open PR) also contemplates doing schema changes in the metadata store. @capistrant I see you've commented in there too. Will the feature in this PR be able to use that other PR's mechanism? (Once it's merged?)

Jun 14 '22 00:06 gianm

Alternatively, should #12404 use the schema migration mechanism from this patch? I haven't reviewed both closely. I'm not sure which mechanism is preferable. Curious what @AmatyaAvadhanula and @imply-cheddar and @kfaraz think too.

Jun 14 '22 00:06 gianm

As to the main change: what do you think of calling the new column last_updated? IMO, that's clearer, since last_used sounds like it might be the last time the segment was loaded or queried or something like that.

Generally, this change would be very welcome. Measuring time-to-kill from the last update time makes more sense than measuring it from the data interval.

Jun 14 '22 00:06 gianm

As to the main change: what do you think of calling the new column last_updated? IMO, that's clearer, since last_used sounds like it might be the last time the segment was loaded or queried or something like that.

Generally, this change would be very welcome. Measuring time-to-kill from the last update time makes more sense than measuring it from the data interval.

I'm certainly open to different naming. last_changed spooks me because there could be a future where we change a row for some other reason than flipping the used flag. Perhaps used_last_updated, [marked_]unused_since (although this one might require us to re-think what the timestamp is when used==true).

Jun 15 '22 12:06 capistrant

#12404 (currently-open PR) also contemplates doing schema changes in the metadata store. @capistrant I see you've commented in there too. Will the feature in this PR be able to use that other PR's mechanism? (Once it's merged?)

My plan is to use as much of this PRs process as is feasible. Our approaches are nearly identical for the actual alteration of the table on startup of the controlling service (overlord for that one, coordinator for mine). The difference really emerges in the migration process. The linked PR is more complicated than mine because it is replacing existing always-on functionality. That requires them to be deliberate about how they transition from querying sql the old way to the new way. They background a thread on startup that incrementally brings the table up to spec by populating rows. I could certainly implement a similar approach instead of allowing the operator to optionally make their existing unused segments compatible with the buffer period by using a druid cli tool or an adhoc sql statement.

Long story short, I don't think there is anything 12404 would/should take from my implementation. However, there is certainly shared code that I can take from them once they are merged. I will contemplate the idea of a background thread that brings the table up to spec, it would really be pretty straight forward. If there are no unused rows with null last_used exit immediately. Otherwise use 12404 incremental batching process to update with current date until there are no null values in the column. That would eliminate overhead/confusion for our operators

Jun 15 '22 15:06 capistrant

Todo reminder for myself: evaluate feasibility of integration test of UpdateTable.java cli tool

Jun 23 '22 14:06 capistrant

Thanks for the review and thoughts on changes @gianm ... I decided to go forward with the suggestion for background updates for unused segments with NULL value for new column. I wasn't completely sure on how implementation of this should look so some feedback on that code in SegmentsMetadataManager would be much appreciated. As of now thing "work" in my local cluster for testing, but I'm not sure it is the cleanest solution out there

My maven profile on my personal computer seems a bit messed up right now so I pushed despite not being able to check all the unit tests locally at this time. I'm working on fixing that tomorrow so I can ID any issues I may have introduced.

Aug 23 '22 21:08 capistrant

@capistrant - are you still working on this PR? Asking since this is a really useful feature. And after this, we can enable the auto-killing of unused segments.

Nov 30 '22 08:11 abhishekagarwal87

@capistrant - are you still working on this PR? Asking since this is a really useful feature. And after this, we can enable the auto-killing of unused segments.

@abhishekagarwal87 yes I am still very much interested in getting these merged upstream. I have continued to try and keep it up to date with master. I see that I have some recent conflicts that I have to resolve. I will try to fit that in this week

Dec 13 '22 19:12 capistrant

@capistrant - Just checking to see if you are still planning to take this through. If not, are you open to someone else picking it up?

Feb 14 '23 05:02 abhishekagarwal87

@capistrant - Checking again to see if you're interested in picking this up again, if not, would you mind if I picked it up?

Jul 25 '23 01:07 jon-wei

@capistrant Thanks for the update, taking a look at this today

Aug 14 '23 18:08 jon-wei

@jon-wei @capistrant Is this PR good to merge?

@jon-wei , I tried clicking on https://github.com/apache/druid/pull/12599/files#r940734962 in your previous comment but i'm not seeing any comment when I open the link

Aug 17 '23 06:08 maytasm

Should used_flag_last_updated column be part of the index in the segment table?

Jan 23 '24 23:01 maytasm

Add a configurable bufferPeriod between when a segment is marked unused and deleted by KillUnusedSegments duty

Description

Made an incompatible change to druid_segments table

Added a configuration to the druid.coordinator.kill. family, druid.coordinator.kill.bufferPeriod

Added a single threaded Executor on the coordinator for updating null values of new column

Alternative Design

Upgrade Plan

No work necessary case

Some work necessary case

Optional post upgrade action

Key changed/added classes in this PR

Added a configuration to the `druid.coordinator.kill.` family, `druid.coordinator.kill.bufferPeriod`