ksql
ksql copied to clipboard
fix: closing a rebalancing transient query may leak the state store directory
Description
Fixes https://github.com/confluentinc/ksql/issues/6360
If a small timeout is set during the transient query close and the query is still rebalancing after the timeout expires, then the stream thread re-creates the internal topics and state stores causing the leak.
This PR derives the query timeout from the internal stream threads timeouts instead of using a configured timeout.
Testing done
Unit tests
Reviewer checklist
- [ ] Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
- [ ] Ensure relevant issues are linked (description should include text like "Fixes #
")
@agavra The user does not have a way to set those internal timeouts. All internal closing methods have hard-coded values. The implication with a high value is the user may wait too much time until the query is closed. But setting the timeout too low also have the implication of leaking the state store if the stream was rebalancing.
@rodesai What do you think about this value?
The implication with a high value is the user may wait too much time until the query is closed.
Is this on the CLI path? e.g. if I ctrl-c a query does it hang for 5 min? Where does the user "see" this, or is it in the background?
Is this on the CLI path? e.g. if I ctrl-c a query does it hang for 5 min? Where does the user "see" this, or is it in the background?
This should only come into play if the query is stuck for some reason (e.g. in a rebalance). On a normal close it should be quick. That said, the default timeout here is 5 minutes. So this patch would actually make us wait less. @spena I think based on your analysis in the ticket, we can't actually bound the wait time here, since the producer and admin clients wait indefinitely. Since this is the case, perhaps the right approach is to instead do a relatively short wait inline, and then if that times out, move the work to a background task that keeps trying to close the query. wdyt?
@rodesai Isn't the current close() called in a background thread already?
I looked at the QueryStreamWriter
. If the client terminates the connection (stopped the transient query), then the writer will close the query. This writer, If I understand, runs in the background, isn't it?
just curious if this PR is still in progress?
@guozhangwang It needs more discussion I think. The PR does what it is expected, but it can take long to close if the timeout is large. Currently, ksql closes a transient query in the background if I understand, so I'm not sure about this:
... Since this is the case, perhaps the right approach is to instead do a relatively short wait inline,
and then if that times out, move the work to a background task that keeps trying to close the query. wdyt?
I was wondering if by background approach means: all query metrics and metadata should be cleaned after a short period of time, then put the close() in the background to wait until it stream closes it, thus avoiding re-creating the state store directory? @rodesai is that a good understanding?
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.