dbt-core icon indicating copy to clipboard operation
dbt-core copied to clipboard

[testing] Reduce flaky tests by retrying git failures

Open emmyoop opened this issue 1 year ago • 2 comments

Housekeeping

  • [X] I am a maintainer of dbt-core

Short description

We have a lot of tests that are failing because of Git connection issues. Sometimes tox fails to install all dependencies and that causes the entire test run to fail without actually running any tests. This makes our monitoring noisy.

Suggested approach: leveraging something the nick-fields/retry@v3 action (example but in the tox invocation here)

Acceptance criteria

Anytime we use git when testing, have retry logic

Suggested Tests

This task is specifically for tests

-- can force a test to fail in a commit & observe the retry works as expected at the integration group level

Impact to Other Teams

Adapters team won't be impacted but may be interested if we come up with a solution

Will backports be required?

backport as far as we can to reduce this noise

Context

log output from test failing on tox

Run tox -- --ddtrace
integration: install_deps> python -I -m pip install -r dev-requirements.txt -r editable-requirements.txt
  Running command git clone --filter=blob:none --quiet https://github.com/dbt-labs/dbt-adapters.git /tmp/pip-req-build-g9zkv3vu
  error: RPC failed; curl 16 Error in the HTTP2 framing layer
  fatal: expected 'packfile'
  fatal: could not fetch 22b2ad3f683cca452f28320c0aba8bb95933ca6e from promisor remote
Collecting git+https://github.com/dbt-labs/dbt-adapters.git@main (from -r dev-requirements.txt (line 1))
  Cloning https://github.com/dbt-labs/dbt-adapters.git (to revision main) to /tmp/pip-req-build-g9zkv3vu
integration: exit 1 (2.55 seconds) /home/runner/work/dbt-core/dbt-core> python -I -m pip install -r dev-requirements.txt -r editable-requirements.txt pid=1980
  warning: Clone succeeded, but checkout failed.
  You can inspect what was checked out with 'git status'
  and retry with 'git restore --source=HEAD :/'

  error: subprocess-exited-with-error
  
  × git clone --filter=blob:none --quiet https://github.com/dbt-labs/dbt-adapters.git /tmp/pip-req-build-g9zkv3vu did not run successfully.
  │ exit code: 1[28](https://github.com/dbt-labs/dbt-core/actions/runs/8633734890/job/23667503237#step:8:29)
  ╰─> See above for output.
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× git clone --filter=blob:none --quiet https://github.com/dbt-labs/dbt-adapters.git /tmp/pip-req-build-g9zkv3vu did not run successfully.
│ exit code: 128
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
  integration: FAIL code 1 (5.44 seconds)
  evaluation failed :( (5.50 seconds)

Error: Process completed with exit code 1.

Sample of tests marked as flaky but are likely just connection issues. There may not be a solution when there's a longer GitHub outage. Look through #7808 for other possible failures.

#9906

max retries exceeded #9905 #9903

timeout #9902 #9900

Note: integration tests are run with the workflow_dispatch trigger in scheduled testing here. typically it would be run with workflow_call trigger but isn't because it's special (comment)

emmyoop avatar Apr 12 '24 14:04 emmyoop

From refinement:

  • At what level should the retry logic live? Options: GH workflow (all we can really do if we fail at the tox step), using existing retry/fallback code
  • Could consider marking all gh-sensitive tests into a group and running them on their own test worker

MichelleArk avatar Apr 16 '24 18:04 MichelleArk

Hit this again on 1.3 and 1.4 today

emmyoop avatar Apr 16 '24 19:04 emmyoop

@emmyoop It looks like pip already retries network connections up to 5 times: https://pip.pypa.io/en/stable/cli/pip/#cmdoption-retries

Given this information, I'm not sure if adding retries to our test runner (tox in this case) would improve the situation.

Similar issue in a GCP repo: https://github.com/GoogleCloudPlatform/python-docs-samples/issues/3485#issuecomment-624417589

Thoughts?

aranke avatar May 08 '24 15:05 aranke

Opened a new issue in dbt-labs/docs.getdbt.com: https://github.com/dbt-labs/docs.getdbt.com/issues/5504

FishtownBuildBot avatar May 14 '24 11:05 FishtownBuildBot

hey @aranke , it looks like this opened a docs issue -- can I double check what customer-facing changes are needed? from skimming this issue, it looks like this is more internal testing?

mirnawong1 avatar Jul 17 '24 13:07 mirnawong1