[testing] Reduce flaky tests by retrying git failures
Housekeeping
- [X] I am a maintainer of dbt-core
Short description
We have a lot of tests that are failing because of Git connection issues. Sometimes tox fails to install all dependencies and that causes the entire test run to fail without actually running any tests. This makes our monitoring noisy.
Suggested approach: leveraging something the nick-fields/retry@v3 action (example but in the tox invocation here)
Acceptance criteria
Anytime we use git when testing, have retry logic
Suggested Tests
This task is specifically for tests
-- can force a test to fail in a commit & observe the retry works as expected at the integration group level
Impact to Other Teams
Adapters team won't be impacted but may be interested if we come up with a solution
Will backports be required?
backport as far as we can to reduce this noise
Context
log output from test failing on tox
Run tox -- --ddtrace
integration: install_deps> python -I -m pip install -r dev-requirements.txt -r editable-requirements.txt
Running command git clone --filter=blob:none --quiet https://github.com/dbt-labs/dbt-adapters.git /tmp/pip-req-build-g9zkv3vu
error: RPC failed; curl 16 Error in the HTTP2 framing layer
fatal: expected 'packfile'
fatal: could not fetch 22b2ad3f683cca452f28320c0aba8bb95933ca6e from promisor remote
Collecting git+https://github.com/dbt-labs/dbt-adapters.git@main (from -r dev-requirements.txt (line 1))
Cloning https://github.com/dbt-labs/dbt-adapters.git (to revision main) to /tmp/pip-req-build-g9zkv3vu
integration: exit 1 (2.55 seconds) /home/runner/work/dbt-core/dbt-core> python -I -m pip install -r dev-requirements.txt -r editable-requirements.txt pid=1980
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'
error: subprocess-exited-with-error
× git clone --filter=blob:none --quiet https://github.com/dbt-labs/dbt-adapters.git /tmp/pip-req-build-g9zkv3vu did not run successfully.
│ exit code: 1[28](https://github.com/dbt-labs/dbt-core/actions/runs/8633734890/job/23667503237#step:8:29)
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
× git clone --filter=blob:none --quiet https://github.com/dbt-labs/dbt-adapters.git /tmp/pip-req-build-g9zkv3vu did not run successfully.
│ exit code: 128
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
integration: FAIL code 1 (5.44 seconds)
evaluation failed :( (5.50 seconds)
Error: Process completed with exit code 1.
Sample of tests marked as flaky but are likely just connection issues. There may not be a solution when there's a longer GitHub outage. Look through #7808 for other possible failures.
#9906
max retries exceeded #9905 #9903
timeout #9902 #9900
Note: integration tests are run with the workflow_dispatch trigger in scheduled testing here. typically it would be run with workflow_call trigger but isn't because it's special (comment)
From refinement:
- At what level should the retry logic live? Options: GH workflow (all we can really do if we fail at the tox step), using existing retry/fallback code
- Could consider marking all gh-sensitive tests into a group and running them on their own test worker
Hit this again on 1.3 and 1.4 today
@emmyoop It looks like pip already retries network connections up to 5 times: https://pip.pypa.io/en/stable/cli/pip/#cmdoption-retries
Given this information, I'm not sure if adding retries to our test runner (tox in this case) would improve the situation.
Similar issue in a GCP repo: https://github.com/GoogleCloudPlatform/python-docs-samples/issues/3485#issuecomment-624417589
Thoughts?
Opened a new issue in dbt-labs/docs.getdbt.com: https://github.com/dbt-labs/docs.getdbt.com/issues/5504
hey @aranke , it looks like this opened a docs issue -- can I double check what customer-facing changes are needed? from skimming this issue, it looks like this is more internal testing?