ksql icon indicating copy to clipboard operation
ksql copied to clipboard

fix: closing a rebalancing transient query may leak the state store directory

Open spena opened this issue 4 years ago • 7 comments

Description

Fixes https://github.com/confluentinc/ksql/issues/6360

If a small timeout is set during the transient query close and the query is still rebalancing after the timeout expires, then the stream thread re-creates the internal topics and state stores causing the leak.

This PR derives the query timeout from the internal stream threads timeouts instead of using a configured timeout.

Testing done

Unit tests

Reviewer checklist

  • [ ] Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • [ ] Ensure relevant issues are linked (description should include text like "Fixes #")

spena avatar Dec 14 '20 17:12 spena

@agavra The user does not have a way to set those internal timeouts. All internal closing methods have hard-coded values. The implication with a high value is the user may wait too much time until the query is closed. But setting the timeout too low also have the implication of leaking the state store if the stream was rebalancing.

@rodesai What do you think about this value?

spena avatar Dec 16 '20 15:12 spena

The implication with a high value is the user may wait too much time until the query is closed.

Is this on the CLI path? e.g. if I ctrl-c a query does it hang for 5 min? Where does the user "see" this, or is it in the background?

agavra avatar Dec 16 '20 16:12 agavra

Is this on the CLI path? e.g. if I ctrl-c a query does it hang for 5 min? Where does the user "see" this, or is it in the background?

This should only come into play if the query is stuck for some reason (e.g. in a rebalance). On a normal close it should be quick. That said, the default timeout here is 5 minutes. So this patch would actually make us wait less. @spena I think based on your analysis in the ticket, we can't actually bound the wait time here, since the producer and admin clients wait indefinitely. Since this is the case, perhaps the right approach is to instead do a relatively short wait inline, and then if that times out, move the work to a background task that keeps trying to close the query. wdyt?

rodesai avatar Dec 22 '20 19:12 rodesai

@rodesai Isn't the current close() called in a background thread already?

I looked at the QueryStreamWriter. If the client terminates the connection (stopped the transient query), then the writer will close the query. This writer, If I understand, runs in the background, isn't it?

spena avatar Jan 04 '21 18:01 spena

just curious if this PR is still in progress?

guozhangwang avatar Jan 22 '21 23:01 guozhangwang

@guozhangwang It needs more discussion I think. The PR does what it is expected, but it can take long to close if the timeout is large. Currently, ksql closes a transient query in the background if I understand, so I'm not sure about this:

... Since this is the case, perhaps the right approach is to instead do a relatively short wait inline, 
and then if that times out, move the work to a background task that keeps trying to close the query. wdyt?

I was wondering if by background approach means: all query metrics and metadata should be cleaned after a short period of time, then put the close() in the background to wait until it stream closes it, thus avoiding re-creating the state store directory? @rodesai is that a good understanding?

spena avatar Jan 25 '21 22:01 spena

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

cla-assistant[bot] avatar Nov 15 '23 20:11 cla-assistant[bot]