Retry connection on 'Remote end closed connection without response'
I initially reported this issue against dbt-databricks but was asked by a collaborator to file it here.
So basically, when running dbt request against a serverless sql warehouse, we get intermittent errors as below and the dag execution is aborted. Runtime Error ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
I believe it should be safe to retry connection and retry execution.
databricks-sql-connector==2.9.5
Hi @septponf! Thank you for reporting this issue. Yes, retry logic currently is not perfect, and each report like this helps to make it better. I need to explore your issue more, but generally retrying query execution is not safe, even on errors like terminated connection. See, if you submitted query, but didn't get the response from the server (for any reason) - you don't know on which stage query execution failed. Maybe server didn't even start processing the request, as well as it may be possible that query was successfully executed but server failed when sending the response to the library. And it's relatively safe to re-execute read queries (e.g. SELECT or SHOW TABLES, etc.), but it's definitely not safe to re-execute queries that update data.
We do have a list of errors that a safe to retry, and this library still doesn't fully implement it. Right now I'm doing another iteration to improve the retry logic of this library, and I will check what I can do in your case. But keep in mind what I explained above. In some cases, only user can decide what is safe to retry and what's not
Thank you @kravets-levko for swift response. I understand the general predicament. But general retries for i.e. select x, show x, alter x, create or replace x, would be nice.
I appreciate you looking into this
@kravets-levko Found this issue when looking into the same problem with dbt runs against serverless warehouses. I did want to add a comment - on the issue in the dbt-databricks repo @benc-db said the below
Would you mind filing against https://github.com/databricks/databricks-sql-python? Basically explain that we don't retry when we get 'Remote end closed connection without response', but that it should be safe to do so? In that package we aim to retry safe commands, i.e. ones that either are idempotent or that we know the server didn't receive, but in this case we have evidence that getting this response means the server didn't receive or otherwise that no action was taken. I will also take into consideration some version of model retry, but do not have capacity to explore right now.
You stated in your response to @septponf that we shouldn't retry these because it is not safe - does what @benc-db stated change that? It seems that the two of you are of differing opinions on if it is truly safe to retry here - if @benc-db is correct and we can be confident that this error means we were never able to try the query then we should be able to retry.
On the other hand, if you are correct and we cannot guarantee that it is safe to retry based on this error message then we can likely just close the issue on this repo, as it will by necessity need to be handled downstream (or so I would think).
@NodeJSmith since I commented that, I have subsequently seen issues where the connection gets broken but the thrift server does schedule the command for execution ;(
Damn, that's unfortunate. Would there be anyway to query the databricks API for the status of the query using the statement ID to attempt to retry based on that, like with the get_status call?
If we have a statement id, does that mean it was scheduled? I think the core idea makes sense if we have the ID available in cases where we get disconnected. Don't reissue, but just check to see if the server knows about it. That might also fail, because the scenarios I'm thinking of, the server is so overloaded we stop getting responses, but it's something to try. @kravets-levko thoughts?
Hi all, do you guys have any workaround for this? We are using databricks-sql-connector 2.9.5 and running into this issue quite frequently.