databricks-sql-python Retry connection on 'Remote end closed connection without response'

I initially reported this issue against dbt-databricks but was asked by a collaborator to file it here.

So basically, when running dbt request against a serverless sql warehouse, we get intermittent errors as below and the dag execution is aborted. Runtime Error ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

I believe it should be safe to retry connection and retry execution.

databricks-sql-connector==2.9.5

Aug 29 '24 07:08 septponf

Hi @septponf! Thank you for reporting this issue. Yes, retry logic currently is not perfect, and each report like this helps to make it better. I need to explore your issue more, but generally retrying query execution is not safe, even on errors like terminated connection. See, if you submitted query, but didn't get the response from the server (for any reason) - you don't know on which stage query execution failed. Maybe server didn't even start processing the request, as well as it may be possible that query was successfully executed but server failed when sending the response to the library. And it's relatively safe to re-execute read queries (e.g. SELECT or SHOW TABLES, etc.), but it's definitely not safe to re-execute queries that update data.

We do have a list of errors that a safe to retry, and this library still doesn't fully implement it. Right now I'm doing another iteration to improve the retry logic of this library, and I will check what I can do in your case. But keep in mind what I explained above. In some cases, only user can decide what is safe to retry and what's not

Aug 29 '24 09:08 kravets-levko

Thank you @kravets-levko for swift response. I understand the general predicament. But general retries for i.e. select x, show x, alter x, create or replace x, would be nice.

I appreciate you looking into this

Aug 29 '24 12:08 septponf

@kravets-levko Found this issue when looking into the same problem with dbt runs against serverless warehouses. I did want to add a comment - on the issue in the dbt-databricks repo @benc-db said the below

Would you mind filing against https://github.com/databricks/databricks-sql-python? Basically explain that we don't retry when we get 'Remote end closed connection without response', but that it should be safe to do so? In that package we aim to retry safe commands, i.e. ones that either are idempotent or that we know the server didn't receive, but in this case we have evidence that getting this response means the server didn't receive or otherwise that no action was taken. I will also take into consideration some version of model retry, but do not have capacity to explore right now.

You stated in your response to @septponf that we shouldn't retry these because it is not safe - does what @benc-db stated change that? It seems that the two of you are of differing opinions on if it is truly safe to retry here - if @benc-db is correct and we can be confident that this error means we were never able to try the query then we should be able to retry.

On the other hand, if you are correct and we cannot guarantee that it is safe to retry based on this error message then we can likely just close the issue on this repo, as it will by necessity need to be handled downstream (or so I would think).

Sep 19 '24 17:09 NodeJSmith

@NodeJSmith since I commented that, I have subsequently seen issues where the connection gets broken but the thrift server does schedule the command for execution ;(

Sep 19 '24 17:09 benc-db

Damn, that's unfortunate. Would there be anyway to query the databricks API for the status of the query using the statement ID to attempt to retry based on that, like with the get_status call?

Sep 19 '24 17:09 NodeJSmith

If we have a statement id, does that mean it was scheduled? I think the core idea makes sense if we have the ID available in cases where we get disconnected. Don't reissue, but just check to see if the server knows about it. That might also fail, because the scenarios I'm thinking of, the server is so overloaded we stop getting responses, but it's something to try. @kravets-levko thoughts?

Sep 19 '24 17:09 benc-db

Hi all, do you guys have any workaround for this? We are using databricks-sql-connector 2.9.5 and running into this issue quite frequently.

Oct 14 '24 09:10 bolinzzz