ksql icon indicating copy to clipboard operation
ksql copied to clipboard

KSQL `run script` times out waiting for previous statement to be applied.

Open rspurgeon opened this issue 5 years ago • 10 comments

Description KSQL run script fails to run a script which contains CREATE statements which depend on previously created objects in the same script. Possibly related, the issue is observed when Confluent Platform is confiured using pkcs12 key store types instead of the legacy (default) jks store type, as observed in the cp-demo confluent demonstration project.

To Reproduce Steps to reproduce the behavior, include:

  1. Pull the latest 5.4.x CP docker images (as of 12/3/2019) and retag them as 5.4.0-SNAPSHOT locally
  2. The DEVX-1032 branch in cp-demo contains the necessary bits. Running the demo, as instructed in that repo on that branch will reliably yield the error. https://github.com/confluentinc/cp-demo/tree/DEVX-1032

Expected behavior The demo's ksql section sucessfully deploys the script as defined in the script/ksql/run-ksql.sh which utilizes the run script KSQL command.

Actual behaviour KSQL reports the following error in the KSQL CLI:

ksql> Error: command not executed since the server timed out while waiting for prior commands to finish executing.
If you wish to execute new commands without waiting for prior commands to finish, run the command 'request-pipelining ON'.
Timed out while waiting for a previous command to execute. command sequence number: 14
ksql> Exiting KSQL.

Additional context Additional context in this, possibly, related KSQL Issue: https://github.com/confluentinc/ksql/issues/3363

rspurgeon avatar Dec 03 '19 17:12 rspurgeon

@big-andy-coates this issue is still present in 5.4.1. Per your comment in https://github.com/confluentinc/ksql/issues/3363#issuecomment-557499226 , can you please confirm whether this issue should be fixed in 5.5.0?

ybyzek avatar Mar 07 '20 16:03 ybyzek

Sorry for the delay in looking into this!

This error:

Error: command not executed since the server timed out while waiting for prior commands to finish executing.

Tells me that this issue is not related to #3363, which reports an error where submitting multiple inter-dependent commands together fails, because the server fails to process later commands that depend on earlier commands already having been applied to the engine state.

The error is telling me that the server failed while processing a particular statement because it timed out waiting for a previous statement to be applied to the engine state.

So the title of this issue:

KSQL `run script` fails to run a script which contains CREATE statements which depend on previously created objects in the same script

Is actually misleading. This issue is NOT about inter-related statements. This error would occur even if all statements where not related. Hence I've updated the title.

So what can be causing this error? (In this instances I'm assuming you're submitting a multi-line script to the server, either directly or via the CLI's RUN SCRIPT. )

When the server processes a multi-statement request it will submit each CREATE/TERMINATE/DROP statement to command topic in turn, first waiting for any previous statements to be consumed by the main command processing thread. This ensures each statement is validated against the correct engine state. For example, this stops two servers both adding statements that create a source with the same name to the topic.

So your multiple statements are processed in a loop that:

  1. Writes a statement to the command topic
  2. Waits for the processing thread to process that statement
  3. Goto step 1 and process the next statement.

In your case, step 2 is timing out. This is... unusual! The processing thread is generally very quick at picking up any pending statements from the command topic. Only under very high load of new statements would we expect any issues.

I'm assuming we can rule out high load. So what's going on? Well, as you can see, the processing of your script involves both writing to, and reading from, the command topic. Is the reading side was slow, or not working for some reason, then this too could trigger this error.

I notice that the branch you're referring to is changing SSL settings. Unfortunately, the branch no longer exists, so I can't try anything out or see what's being changed. However, my educated guess would be that either:

  • somethings is stopping the processing thread from reading from the command topic.
  • or, something is making reading from the command topic very slow.

To debug, someone would probably need to look at the server logs, ensuring io.confluent.ksql.rest.server.computation.CommandRunner was logging at, at least, INFO level.

The CommandRunner is the class picking up new statements from the command topic. On start up it will log Restoring previous state from N commands where N is the number of historic commands in command topic. If you don't see that, then it can't read the command topic! Any new commands will cause a Executing statement: blah log line.

Debugging this issue will require someone to watch the above log lines come in to see if they are occurring and how quickly the server is reading commands, and to check for any errors.

I'm noted all this down to hopefully explain some background and what needs to be done to diagnose something like this in the future. Without any way to recreate this, I'm unfortunately going to have to close this issue.

big-andy-coates avatar May 15 '20 14:05 big-andy-coates

Good news and not good news:

Good news: @big-andy-coates I have created a special branch in cp-demo that reverts to the run script syntax we originally had when this issue was reported: https://github.com/confluentinc/cp-demo/tree/GH-KSQL-4029 . This branch can be used for debugging

Not good news (actually this is good news for the product, but not good news because we haven't properly chased down and understood the issue): With this branch I reran cp-demo twice and could not reproduce the issue reported in this GitHub issue. @big-andy-coates @rspurgeon Can either of you reproduce it on this branch?

ybyzek avatar May 23 '20 01:05 ybyzek

Given we can't recreate.... do we have any option but to close this issue?

big-andy-coates avatar Jun 19 '20 16:06 big-andy-coates

@rspurgeon can you please try to recreate this? If you run and can't recreate it either, then we should

  1. Close this issue
  2. Revert changes in cp-demo for faster execution

ybyzek avatar Jun 19 '20 17:06 ybyzek

@rspurgeon @ybyzek I cherry-picked the GH-KSQL-4029 commit onto current cp-demo branch 6.0.0-post head and could not reproduce this with one attempt - queries seemed to execute fine.

Also - ksqlDB CLI startup seems to be very slow in my environment - so this cuts a fair bit of time off cp-demo deployment time.

javabrett avatar Oct 25 '20 10:10 javabrett

Now I've got this constantly reproduced on Confluent 6.0 using KSQL Rest API. Getting the response

{ "@type": "generic_error", "error_code": 50301, "message": "Timed out while waiting for a previous command to execute. command sequence number: 28008" }

ealeykin avatar Mar 10 '21 17:03 ealeykin

So basically the flow as follows:

KSQL Rest API -> writes statement to kafka command topic - at this point OK KSQL -> reads statement from command topic - and uses a configured consumer timeout to get a statement for execution - this timeout is the reason for subsequent statement to fail.

ksql.server.command.response.timeout.ms, default 5s - unfortunately this seems undocumented and had to dig into sources

  • below timeout - statement execution status is SUCCESS
  • otherwise - QUEUED - subsequent statement that depends on previous one will fail, because the first one is still queued and not executed yet.

Basically increasing this timeout makes scripts processor to wait longer to get SUCCESS status before subsequent statement execution - and 15sec fixed that for me.

However I'm not sure whether this is a correct way of dealing with such timeouts. Perhaps a Kafka consumer lag causes the issue.

Any further advices on this please?

ealeykin avatar Mar 10 '21 21:03 ealeykin

Hi @ybyzek @javabrett we're currently run into this error with ksqlDB v. 0.28.2 when we do an initial deployment of our DWH. For testing / reproducing this you can simply execute a statement again and again. The execution become slower and slower until we timeout at ~ the 150th iteration.

We already tested various ways (e.g. migration tool but since under the hood all of them use the command topic to executed the statements in a synchronous manner each of them timeout eventually.

Also we cannot find anything in the docs about the mentioned setting request-pipelining from the error message:

Error: command not executed since the server timed out while waiting for prior commands to finish executing.
If you wish to execute new commands without waiting for prior commands to finish, run the command 'request-pipelining ON'

Is there a way to asynchronously deploy statements? Or how do you recommend to deploy a larger ammount of statements?

BR, Philipp

eeepmb avatar Feb 02 '23 12:02 eeepmb

Hey, is there any update about this? I am facing the same problem. There is nothing in the documents about "request-pipelining" as well. Its same with both migration tool and CLI.

Edit:

Adding this in the config works:

KSQL_KSQL_SERVER_COMMAND_RESPONSE_TIMEOUT_MS: 120000

sh3b avatar Dec 03 '23 02:12 sh3b