ray
ray copied to clipboard
Release test air_benchmark_xgboost_cpu_10.aws failed
Release test air_benchmark_xgboost_cpu_10.aws failed. See https://buildkite.com/ray-project/release-tests-branch/builds/1774#01889f07-2fb3-4bea-a864-9a783bc864a4 for more details. cc @ml
-- created by ray-test-bot
Blamed commit: 30e6b292d87013b290f37399c5b04ffaaff583fb found by bisect job https://buildkite.com/ray-project/release-tests-bisect/builds/216
Test has been failing for far too long. Jailing.
The cluster env issue be fixed along with this test: https://github.com/ray-project/ray/issues/36299
But, this test is still failing for other reasons -- need to keep this open and investigate.
Results: {'training_time': 948.5745256880001, 'prediction_time': 472.03880572799994}
--
| Traceback (most recent call last):
| File "workloads/xgboost_benchmark.py", line 179, in <module>
| main(args)
| File "workloads/xgboost_benchmark.py", line 158, in main
| raise RuntimeError(
| RuntimeError: Batch prediction on XGBoost is taking 472.03880572799994 seconds, which is longer than expected (450 seconds).
Oh since this test has been jailed, after you commit the fix, feel free to close the issue.
Closing the issue will automatically unjail it.
@justinvyu is this issue fixed? thankks
@can-anyscale I believe this test has been flaky for a long time, and the previous issue causing the cluster env to fail building was a separate issue that has been fixed.
The test running above the timeout is the reason for the flakiness (see here):
Results: {'training_time': 948.5745256880001, 'prediction_time': 472.03880572799994}
| Traceback (most recent call last):
| File "workloads/xgboost_benchmark.py", line 179, in <module>
| main(args)
| File "workloads/xgboost_benchmark.py", line 158, in main
| raise RuntimeError(
| RuntimeError: Batch prediction on XGBoost is taking 472.03880572799994 seconds, which is longer than expected (450 seconds).
So, the test is still flaky and needs to be investigated. Once the test is jailed, I'll need to manually kick off the test to debug it on a PR?
@justinvyu got you, yes, you'll need to manually kick off the test to debug on PR
Actually, I think this can be closed. The timeouts seem to only be affecting the gce
variants -- probably due to the machine specs being slightly different compared to the AWS versions. I've noticed this in other tests as well: https://github.com/ray-project/ray/pull/36276#issuecomment-1586607870
We should unjail the test and then debug the GCE version separately.
GCE results, which time out (2 runs):
Run 1: Results: {'training_time': 976.5222113799999, 'prediction_time': 571.5934856480001}
Run 2: Results: {'training_time': 948.5745256880001, 'prediction_time': 472.03880572799994}
AWS results, which pass:
Re-opening issue as test is still failing. Latest run: https://buildkite.com/ray-project/release-tests-branch/builds/1817#0188e1ca-58ca-4046-9eb1-06af28a5f4f5
It's very interesting to see the difference between failing and non-failing runs.
In failing runs, we often see very long times before training actually starts.
E.g. here (passing)
Training finished iteration 1 at 2023-06-09 03:40:54. Total running time: 3min 38s
vs here (failing)
Training finished iteration 1 at 2023-06-21 23:43:58. Total running time: 8min 31s
A training iteration in passing trials is usually ~60 seconds but in failing trials ~120 seconds.
I can't examine the installed packages in the failing runs - because it's using BYOD. My best guess at the moment is that the BYOD image has either different packages installed or generally works differently to the cluster envs we usually ship.
I'm kicking off a non-BYOD run here: https://buildkite.com/ray-project/release-tests-pr/builds/43157
The test is passing, so it is due to the BYOD image: https://buildkite.com/ray-project/release-tests-pr/builds/43157#_
@can-anyscale what is the best way to inspect the BYOD docker image?
@krfricke let me spend sometime myself to investigate and loop back to you; for now I think we should merge the PR to remove it from BYOD ;)
Also to your question Kai about how to inspect what's inside the image; all byod image came with this list of pinned version of dependencies: https://github.com/ray-project/ray/blob/master/release/ray_release/byod/requirements_ml_byod.txt
Close this to unjail this test
Test passed on latest run: https://buildkite.com/ray-project/release-tests-branch/builds/1832#0188fba3-5c7b-40c8-b79d-0d1910ea4e71