spark
spark copied to clipboard
[SPARK-40492][SS] Do maintenance before streaming StateStore unload
What changes were proposed in this pull request?
Before unload of a StateStore, perform a cleanup.
Why are the changes needed?
Current the maintenance of StateStore is performed by a periodic task in the management thread. If a streaming query become inactive before the next maintenance task fire, its StateStore will be unloaded before cleanup. There are 2 cases when a StateStore is unloaded.
- StateStoreProvider is not longer active in the system, for example, when a query ends or the spark context terminates.
- There is other active StateStoreProvider in the system, for example, when a partition is reassigned.
In case 1, we should do one last maintenance before unloading the instance,
Does this PR introduce any user-facing change?
Shutdown delay of a query may increase because a maintenance task is scheduled.
How was this patch tested?
Add an integration test that verify that redundant delta file is deleted when StateStore instances is deactivated and unloaded.
@HeartSaVioR
Can one of the admins verify this patch?
~~I gave a feedback offline but also duplicate here for the history.~~
~~getStateStoreProvider also report active provider instance and get “inactive provider instances” which could be unloaded without maintenance. We can defer these instances to be unloaded till next maintenance interval, but this may not be something we want, since we are effectively reverting SPARK-33827.~~
~~Back to the rationalization of this ticket, technically, what we really want to do is just to ensure all state store provider instances run maintenance "at least once". That said, ideally we should not perform additional maintenance on unloading of provider if it has run the maintenance at least once.~~
~~In getStateStoreProvider, we should only defer some of inactive provider instances which didn't run any maintenance yet. In maintenance task, we need to let provider be unloaded without maintenance if it ran maintenance at least once.~~
EDIT: I realized I missed the details. ReportActiveInstance only reports provider ids which are assigned to different executors. I thought it reports all inactive provider ids including terminated queries...
The change is making sense now. I'll take a second look.
Thanks! Merging to master.
Thanks @chaoqin-li1123 for the contribution! I merged this to master.