dbt-sqlserver icon indicating copy to clipboard operation
dbt-sqlserver copied to clipboard

Retry model on connection errors

Open infused-kim opened this issue 4 years ago • 15 comments

Hey Guys,

Is it possible to implement re-trying of queries if the query fails due to non-model related errors?

For example, sometimes Azure SQL fails with errors like this:

[2021-02-24 05:50:02,777] {dbt_rpc.py:83} INFO - Failed model `base_signups`: Database Error in model base_thrive_leads_signups (models/02_base/base_signups.sql)
  ('08S01', '[08S01] [Microsoft][ODBC Driver 17 for SQL Server]TCP Provider: Error code 0x68 (104) (SQLExecDirectW)')
  compiled SQL at target/run/my_project/models/02_base/base_signups.sql

This is particularly frustrating in daily, automated runs through airlfow. I have solved it for now by adding retry logic through airflow that re-runs only failed and skipped models, but perhaps it would be worthwhile solving this at the dbt-sqlserver level for everyone?

Thanks, kim

infused-kim avatar Feb 24 '21 06:02 infused-kim

Interesting. I have never had issues with that. Sounds like something to put in dbt-core? @jtcohen6 ?

mikaelene avatar Feb 24 '21 06:02 mikaelene

@mikaelene There is retry logic within the connection object for some adapters, notably dbt-bigquery. Here we define which errors are intermittent, unlikely to recur, and thereby "retryable". Here is where those errors are caught, handled, and retried, up to the number of retries (docs) the user has defined:

I don't know what TCP Provider: Error code 0x68 is, what would cause it, or why it comes up intermittently, but all of that would be good to sort out before trying to handle it.

jtcohen6 avatar Feb 24 '21 08:02 jtcohen6

i was looking for a 'timeout' to wait until a connection is established.

majidaldo avatar Sep 22 '21 15:09 majidaldo

i was looking for a 'timeout' to wait until a connection is established.

@majidaldo can you tell me more about what you're looking for with a timeout?

dataders avatar Sep 22 '21 16:09 dataders

i was looking for a 'timeout' to wait until a connection is established.

@majidaldo can you tell me more about what you're looking for with a timeout?

On Azure SQL, you can have a db 'sleep'. It can take a few minutes to resume. In the meantime, connection attempts usually fail.

majidaldo avatar Sep 22 '21 16:09 majidaldo

i was looking for a 'timeout' to wait until a connection is established.

@majidaldo can you tell me more about what you're looking for with a timeout?

On Azure SQL, you can have a db 'sleep'. It can take a few minutes to resume. In the meantime, connection attempts usually fail.

great point! this is something we've been thinking about w/ azure sql serverless as well. do you mind opening a separate issue for that scenario? I believe your ask is different than @infused-kim's.

dataders avatar Sep 22 '21 16:09 dataders

i was looking for a 'timeout' to wait until a connection is established.

@majidaldo can you tell me more about what you're looking for with a timeout?

On Azure SQL, you can have a db 'sleep'. It can take a few minutes to resume. In the meantime, connection attempts usually fail.

great point! this is something we've been thinking about w/ azure sql serverless as well. do you mind opening a separate issue for that scenario? I believe your ask is different than @infused-kim's.

https://github.com/dbt-msft/dbt-sqlserver/issues/162

majidaldo avatar Dec 12 '21 21:12 majidaldo

We've seen some intermittent database connection error with Redshift, something like

connection to server at "redshift cluster domain name", port 5439 failed: timeout expired

If there is way to configure retry for a specified times, that would be great!

yduan-polo avatar Jun 29 '22 13:06 yduan-polo

@yduan-polo I think you want to bring up your error as an issue on the dbt-redshift adapter repo!

dataders avatar Jun 29 '22 16:06 dataders

I think we've got one for just the thing: https://github.com/dbt-labs/dbt-redshift/issues/96

(though the fix would more likely come in https://github.com/dbt-labs/dbt-core/issues/5022)

jtcohen6 avatar Jun 29 '22 16:06 jtcohen6

We've been encountering regular transient network issues as well which would be solved by a retry. Are there any plans to support this as a profile parameter (like in Postgres / redshift) or not?

nysthee avatar Aug 22 '22 07:08 nysthee

We've been encountering regular transient network issues as well which would be solved by a retry. Are there any plans to support this as a profile parameter (like in Postgres / redshift) or not?

we could! can you share the error messages you're seeing, as well as the SQL Server product that you're using (on-prem, Azure SQL, etc)

dataders avatar Aug 23 '22 14:08 dataders

Hi This is azure sql.

Encountered an error while running operation: Database Error 123 ('28000', "[28000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Login failed for user ''. (18456) (SQLDriverConnect)")

nysthee avatar Aug 23 '22 14:08 nysthee

Hi This is azure sql.

are you using serverless Azure SQL that auto-pauses? while frustrating, this is behavior & message we expect to see. Normally, ~30-60s later the database is resumed and the error message goes away.

Would you like for dbt to retry a connection every few seconds until the error is resumed. Also, if you do for example have the wrong password, would you like dbt to assume that the error is transient and keep trying every few seconds to re-connect?

not trying to be critical, i'm just narrowing down the use case

dataders avatar Aug 25 '22 18:08 dataders

Hi

No we are not using the auto pause variant.

I think more something try x amount of times and then fail anyway in case of a non transient error. I believe the error code coming back from sql server should specify the kind of error that gets returned maybe that could narrow the cases were a retry would be even useful?

On 25 August 2022 at 20:14:22, Anders @.***) wrote:

Hi This is azure sql.

are you using serverless Azure SQL that auto-pauses? while frustrating, this is behavior & message we expect to see. Normally, ~30-60s later the database is resumed and the error message goes away.

Would you like for dbt to retry a connection every few seconds until the error is resumed. Also, if you do for example have the wrong password, would you like dbt to assume that the error is transient and keep trying every few seconds to re-connect?

not trying to be critical, i'm just narrowing down the use case

— Reply to this email directly, view it on GitHub https://github.com/dbt-msft/dbt-sqlserver/issues/119#issuecomment-1227607439, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACYIREBEF2DV6RK33JNE4RLV26ZX5ANCNFSM4YD34OPQ . You are receiving this because you commented.Message ID: @.***>


Thomas +32 475 269 452 <+32%20475%20269%20452>

nysthee avatar Aug 25 '22 20:08 nysthee

The dbt-sqlserver 1.2.0 release includes an option retries in the target configuration which should mitigate these kinds of issues. Let me know if that doesn't work for you so that I can reopen this issue.

sdebruyn avatar Oct 07 '22 10:10 sdebruyn

For reference for others finding this issue.

Code: https://github.com/dbt-msft/dbt-sqlserver/blob/master/dbt/adapters/sqlserver/sql_server_connection_manager.py

Inherited by synapse: https://github.com/microsoft/dbt-synapse/blob/master/dbt/adapters/synapse/synapse_connection_manager.py

Documented here: https://docs.getdbt.com/docs/core/connect-data-platform/mssql-setup

Does anyone have guidance on recommended number of retries? I was going to default to 3, but is there a tradeoff here?

alittlesliceoftom avatar Oct 04 '23 09:10 alittlesliceoftom

Does it actually retry after the initial connection for you guys? https://github.com/dbt-msft/dbt-sqlserver/issues/507

ka-weihe avatar Jul 02 '24 07:07 ka-weihe