dbt-bigquery icon indicating copy to clipboard operation
dbt-bigquery copied to clipboard

[ADAP-498] [Bug] BQ does not retry on 503

Open barberscott opened this issue 1 year ago • 14 comments

Is this a new bug in dbt-bigquery?

  • [X] I believe this is a new bug in dbt-bigquery
  • [X] I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Current if BigQuery returns a 503 error we do not retry even though BigQuery recommends that as the course of action.

Expected Behavior

This is not a regression but rather an oversight -- 503 errors should be both retryable and reopenable since it indicates a transient unavailable condition in BigQuery

Steps To Reproduce

Transient -- requires intermittent error from BQ.

Relevant log output

No response

Environment

- dbt-core: all 
- dbt-bigquery: all

Additional Context

No response

barberscott avatar Apr 26 '23 17:04 barberscott

Thanks for reaching out @barberscott !

We'll put this in our queue.

The solution might be as simple as adding google.cloud.exceptions.ServiceUnavailable to the list here:

https://github.com/dbt-labs/dbt-bigquery/blob/7c216445f8009baa9cec4d61dd56693be1dd79fa/dbt/adapters/bigquery/connections.py#L53-L59

dbeatty10 avatar Apr 27 '23 13:04 dbeatty10

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] avatar Oct 25 '23 01:10 github-actions[bot]

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

github-actions[bot] avatar Nov 01 '23 01:11 github-actions[bot]

@dbeatty10 I created an ServiceUnavailable instance and ran the test code (test_is_retryable).

Current: Not added ServiceUnavailable on RETRYABLE_ERRORS. Result: Test passed.

def test_is_retrievable(self):
        _is_retryable = dbt.adapters.bigquery.connections._is_retryable
        exceptions = dbt.adapters.bigquery.impl.google.cloud.exceptions
        Internal Server Error = Exceptions.Internal Server Error ("Code Abort")
        bad_request_error = Exception.BadRequest("Code is broken")
        connection_error = connection_error("Code broke")
        client_error = Exception.ClientError("Invalid code")
        rate_limit_error = Exception.Forbidden(
            "Code is broken", error=[{"reason": "rateLimitExceeded"}]]
        )
        # add service_unavailable_error
        service_unavailable_error = Exception.ServiceUnavailable("Code is broken")

        self.assertTrue(_is_retryable(internal_server_error))
        self.assertTrue(_is_retryable(bad_request_error))
        self.assertTrue(_is_retryable(connection_error))
        self.assertFalse(_is_retryable(client_error))
        self.assertTrue(_is_retryable(rate_limit_error))
        # passed below assertion
        self.assertTrue(_is_retryable(service_unavailable_error))

https://github.com/dbt-labs/dbt-bigquery/blob/06851679f75d18ece98c95d4eb2a0ddd16544f4d/dbt/adapters/bigquery/connections.py#L57-L63

The ServiceUnavailable class inherits from the ServerError class, so it seems to pass above test. I'd like to fix this, but is there anything else I look at? 🙏

jx2lee avatar Dec 18 '23 12:12 jx2lee

Adding it to the test_is_retryable test like that makes sense 👍

But ... the thing that is surprising to me: if ServiceUnavailable inherits from ServerError and your modified test passes, then why is this not being retried?

Is is possible that the BigQuery client is raising a different error class for 503 errors other than ServiceUnavailable?

@jx2lee Do you happen to have any python stacktraces available where you ran into this problem and dbt-bigquery didn't retry?

dbeatty10 avatar Dec 18 '23 15:12 dbeatty10

@dbeatty10

Is is possible that the BigQuery client is raising a different error class for 503 errors other than ServiceUnavailable?

no, i expected it's impossible. we can create error classes with the from_http_status and from_grpc_status functions. (google.api_core.exceptions). error class generated from this functions always be "ServiceUnavailable"



Do you happen to have any python stacktraces available where you ran into this problem and dbt-bigquery didn't retry?

That issue has never been occured...🙃 I need to more detailed logs when it happened.

IMO, If the issue reporter can't provide more error logs, I think okay to close the issue.

  • 503 code does not return any error class other than ServiceUnavailable
  • The functions that raising error in the googleapis package only generate the ServiceUnavailable

jx2lee avatar Dec 24 '23 08:12 jx2lee

@dbeatty10 Is there anything else should check?

jx2lee avatar Apr 22 '24 14:04 jx2lee

We did hit this recently. We use external-tables on a on-run-start macro. We also use service account impersonation in the dbt profile. While running dbt docs generate on CI environment we got:

('Unable to acquire impersonated credentials', '{\n  "error": {\n    "code": 503,\n    "message": "Authentication backend unavailable.",\n    "status": "UNAVAILABLE"\n  }\n}\n')

Because this happens intermittently on an isolated system, I don't have more logs.

rrbarbosa avatar Apr 25 '24 08:04 rrbarbosa

Thanks for this report @rrbarbosa !

Since this is intermittent (and maybe relatively rare also), it has been hard to nail down.

If anyone can provide information to suggest that dbt is not retrying at least once, that would be very helpful 🙏

dbeatty10 avatar Apr 30 '24 22:04 dbeatty10

@jx2lee -- would you be willing to raise a PR with the addition you made to this test case?

I think that would be sufficient for us to establish that the ServiceUnavailable is retryable (which would allow us to close this issue).

dbeatty10 avatar Apr 30 '24 22:04 dbeatty10

@dbeatty10 okay, i would create PR included above test code soon!

jx2lee avatar May 02 '24 02:05 jx2lee

@dbeatty10 I created PR! Could you edit PR body or add comment to make it easier for reviewers to understand?

jx2lee avatar May 04 '24 16:05 jx2lee

I'm not sure if this is the same code path, but we are seeing a problem with Dataproc (Python models) that dbt is submitting, where dbt successfully submits the batch job, then, during the polling in dbt-labs/dbt-bigquery/dbt/adapters/bigquery/dataproc/batch.py#poll_batch_job, one of the polling calls returns a 503 that is presumably not retried, and dbt errors the model, even though the dataproc job is still running in the background, and eventually completes successfully.

00:25:50  BigQuery adapter: Submitting batch job with id: 5f6d87c9-4045-4208-8941-03fbb8facf30
00:29:58  Unhandled error while executing target/run/core/models/working_tables/WT_rfm_status.py
503 502:Bad Gateway
00:29:58  58 of 63 ERROR creating python table model working_tables.WT_rfm_status ........ ERROR in 248.55s

We have seen the issue twice in a week, and running dbt-bigquery 1.8.1

OSalama avatar Jun 05 '24 12:06 OSalama

Got hit by this issue today, while generating "seed" tables with DBT running in CloudBuild:

"Step #7 - "dbt-seed": ('Unable to acquire impersonated credentials', '{\n  "error": {\n    "code": 503,\n    "message": "The service is currently unavailable.",\n    "status": "UNAVAILABLE"\n  }\n}\n')"

We're using impersonation with dbt-bigquery and it seems IAM was unavailable for a moment. We have no explicit retry configured, so - by the docs - it should retry once, but I see no such thing in the logs.

mkielar avatar Jul 10 '24 05:07 mkielar