ray icon indicating copy to clipboard operation
ray copied to clipboard

Release test air_benchmark_xgboost_cpu_10.aws failed

Open can-anyscale opened this issue 1 year ago • 4 comments

Release test air_benchmark_xgboost_cpu_10.aws failed. See https://buildkite.com/ray-project/release-tests-branch/builds/1774#01889f07-2fb3-4bea-a864-9a783bc864a4 for more details. cc @ml

 -- created by ray-test-bot

can-anyscale avatar Jun 09 '23 07:06 can-anyscale

Blamed commit: 30e6b292d87013b290f37399c5b04ffaaff583fb found by bisect job https://buildkite.com/ray-project/release-tests-bisect/builds/216

can-anyscale avatar Jun 09 '23 11:06 can-anyscale

Test has been failing for far too long. Jailing.

can-anyscale avatar Jun 14 '23 06:06 can-anyscale

The cluster env issue be fixed along with this test: https://github.com/ray-project/ray/issues/36299

But, this test is still failing for other reasons -- need to keep this open and investigate.

Results: {'training_time': 948.5745256880001, 'prediction_time': 472.03880572799994}
--
  | Traceback (most recent call last):
  | File "workloads/xgboost_benchmark.py", line 179, in <module>
  | main(args)
  | File "workloads/xgboost_benchmark.py", line 158, in main
  | raise RuntimeError(
  | RuntimeError: Batch prediction on XGBoost is taking 472.03880572799994 seconds, which is longer than expected (450 seconds).


justinvyu avatar Jun 14 '23 07:06 justinvyu

Oh since this test has been jailed, after you commit the fix, feel free to close the issue.

Closing the issue will automatically unjail it.

can-anyscale avatar Jun 14 '23 17:06 can-anyscale

@justinvyu is this issue fixed? thankks

can-anyscale avatar Jun 21 '23 00:06 can-anyscale

@can-anyscale I believe this test has been flaky for a long time, and the previous issue causing the cluster env to fail building was a separate issue that has been fixed.

The test running above the timeout is the reason for the flakiness (see here):

Results: {'training_time': 948.5745256880001, 'prediction_time': 472.03880572799994}
  | Traceback (most recent call last):
  | File "workloads/xgboost_benchmark.py", line 179, in <module>
  | main(args)
  | File "workloads/xgboost_benchmark.py", line 158, in main
  | raise RuntimeError(
  | RuntimeError: Batch prediction on XGBoost is taking 472.03880572799994 seconds, which is longer than expected (450 seconds).

So, the test is still flaky and needs to be investigated. Once the test is jailed, I'll need to manually kick off the test to debug it on a PR?

justinvyu avatar Jun 21 '23 17:06 justinvyu

@justinvyu got you, yes, you'll need to manually kick off the test to debug on PR

can-anyscale avatar Jun 21 '23 17:06 can-anyscale

Actually, I think this can be closed. The timeouts seem to only be affecting the gce variants -- probably due to the machine specs being slightly different compared to the AWS versions. I've noticed this in other tests as well: https://github.com/ray-project/ray/pull/36276#issuecomment-1586607870

We should unjail the test and then debug the GCE version separately.

GCE results, which time out (2 runs):

Run 1: Results: {'training_time': 976.5222113799999, 'prediction_time': 571.5934856480001}
Run 2: Results: {'training_time': 948.5745256880001, 'prediction_time': 472.03880572799994}

AWS results, which pass:

Screen Shot 2023-06-21 at 6 19 52 PM

justinvyu avatar Jun 22 '23 01:06 justinvyu

Re-opening issue as test is still failing. Latest run: https://buildkite.com/ray-project/release-tests-branch/builds/1817#0188e1ca-58ca-4046-9eb1-06af28a5f4f5

can-anyscale avatar Jun 22 '23 07:06 can-anyscale

It's very interesting to see the difference between failing and non-failing runs.

In failing runs, we often see very long times before training actually starts.

E.g. here (passing)

Training finished iteration 1 at 2023-06-09 03:40:54. Total running time: 3min 38s

vs here (failing)

Training finished iteration 1 at 2023-06-21 23:43:58. Total running time: 8min 31s

A training iteration in passing trials is usually ~60 seconds but in failing trials ~120 seconds.

I can't examine the installed packages in the failing runs - because it's using BYOD. My best guess at the moment is that the BYOD image has either different packages installed or generally works differently to the cluster envs we usually ship.

I'm kicking off a non-BYOD run here: https://buildkite.com/ray-project/release-tests-pr/builds/43157

krfricke avatar Jun 22 '23 07:06 krfricke

The test is passing, so it is due to the BYOD image: https://buildkite.com/ray-project/release-tests-pr/builds/43157#_

@can-anyscale what is the best way to inspect the BYOD docker image?

krfricke avatar Jun 22 '23 08:06 krfricke

@krfricke let me spend sometime myself to investigate and loop back to you; for now I think we should merge the PR to remove it from BYOD ;)

can-anyscale avatar Jun 22 '23 17:06 can-anyscale

Also to your question Kai about how to inspect what's inside the image; all byod image came with this list of pinned version of dependencies: https://github.com/ray-project/ray/blob/master/release/ray_release/byod/requirements_ml_byod.txt

can-anyscale avatar Jun 22 '23 17:06 can-anyscale

Close this to unjail this test

can-anyscale avatar Jun 26 '23 23:06 can-anyscale

Test passed on latest run: https://buildkite.com/ray-project/release-tests-branch/builds/1832#0188fba3-5c7b-40c8-b79d-0d1910ea4e71

can-anyscale avatar Jun 27 '23 07:06 can-anyscale