ray icon indicating copy to clipboard operation
ray copied to clipboard

[release][no_ci] Make sure that test code matches the installed wheel.

Open xwjiang2010 opened this issue 2 years ago • 3 comments

Signed-off-by: xwjiang2010 [email protected]

Why are these changes needed?

When a daily release test is launched, the following may happen:

  1. Latest wheel on master is wheel_n
  2. a PR (that modifies both the test code and the source code) is landed.
  3. a wheel is being built (let's call it wheel_n+1)
  4. Now nightly release test starts to run.
  5. in run_release_test.sh, we fetch most recent wheel and install it. This is wheel_n, since wheel_n+1 is still being built.
  6. in run_release_tst.sh, we check out test code, which is current master (includes PR_n+1).
  7. Now there is a discrepancy between source code (wheel version) and test code.

This PR tries to address this inconsistency issue.

Related issue number

Closes #30055

Checks

  • [ ] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [ ] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
  • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [ ] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

xwjiang2010 avatar Nov 09 '22 21:11 xwjiang2010

I'm not sure this is the best way to go about this. Instead we should probably grab the HEAD commit in build_pipeline.py (we do a git clone there) and propagate it to the script via RAY_TEST_COMMIT. Then checkout that commit if the variable is set.

I am not sure I understand the comment around build_pipeline.py, see above comment.

It would be good to test this if possible. There are existing unit tests for the step building process that can be used as a starting point.

Yes I will add tests after we settle on the implementations.

Also, can you please add a bit of context to the PR (what is the problem, when does it come up, what do we do to solve it). This is just for future reference.

Sure, updated the description to have more details. PTAL

Thanks!

xwjiang2010 avatar Nov 10 '22 20:11 xwjiang2010

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

stale[bot] avatar Dec 11 '22 08:12 stale[bot]

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

stale[bot] avatar Dec 26 '22 02:12 stale[bot]

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

stale[bot] avatar Jan 20 '23 23:01 stale[bot]

Actually I have seen this issue during my last oncall again twice. I will try to revive this PR and land it

xwjiang2010 avatar Apr 17 '23 16:04 xwjiang2010

@krfricke sorry for the delay on this one. I saw two occasions of this my recent oncall. So let's see how to land this. Happy to address new comments.

also cc @can-anyscale

xwjiang2010 avatar Apr 18 '23 00:04 xwjiang2010

Also no idea how to add proper test... open to suggestions folks :)

xwjiang2010 avatar Apr 18 '23 00:04 xwjiang2010

Oh wow, this has been from last year, nice work though!

can-anyscale avatar Apr 18 '23 00:04 can-anyscale

Might be a dummy idea, but have we thought about just unzip the wheel to get the test code instead of git clone?

can-anyscale avatar Apr 18 '23 03:04 can-anyscale

Might be a dummy idea, but have we thought about just unzip the wheel to get the test code instead of git clone?

Not sure if release code is part of our wheel tho, it shouldn't right?

xwjiang2010 avatar Apr 18 '23 13:04 xwjiang2010

Ah, you're absolutely right, it does not

can-anyscale avatar Apr 18 '23 14:04 can-anyscale

@krfricke do you know how to run a manual test for this? The problem with "buildkite/release-tests-pr" is that it will be on my own repo and branch and hence the logic won't kick in :(

xwjiang2010 avatar Apr 18 '23 22:04 xwjiang2010

Does it work if you create a new build from here https://buildkite.com/ray-project/release-tests-pr/builds?branch=xwjiang2010%3Afix_release_test_racing, then unfold the Option to specify the environment variables

can-anyscale avatar Apr 18 '23 22:04 can-anyscale

Does it work if you create a new build from here https://buildkite.com/ray-project/release-tests-pr/builds?branch=xwjiang2010%3Afix_release_test_racing, then unfold the Option to specify the environment variables

It wont work right? This is the case that my test_repo and test_branch are not ray master. So the gating logic will kick in and not resetting hard the test commit.

xwjiang2010 avatar Apr 18 '23 23:04 xwjiang2010

Can you override the test_repo and test_branch by passing the environment variables when creating a buildkite build?

can-anyscale avatar Apr 18 '23 23:04 can-anyscale

release-tests-pr shows

Screen Shot 2023-04-19 at 8 03 21 AM

but doesn't really enter the added logic blob.

xwjiang2010 avatar Apr 19 '23 15:04 xwjiang2010

Well, let's test in production then ;)

can-anyscale avatar Apr 19 '23 15:04 can-anyscale

checked bash portion manually with

if [ "${RAY_TEST_REPO}" == "https://github.com/ray-project/ray.git" ] && \
[[ "${PARSED_RAY_WHEELS}" == *"master"*  ]] && \
[ "${RAY_TEST_BRANCH-}" == "master" ] && [ -n "${RAY_COMMIT_OF_WHEEL-}" ] && \
[ "${HEAD_COMMIT}" != "${RAY_COMMIT_OF_WHEEL}" ]; then
    echo "The checked out test code doesn't match with the installed wheel. \
This is likely due to a racing condition when a PR is landed between \
a wheel is installed and test code is checked out."
    echo "Hard resetting from ${HEAD_COMMIT} to ${RAY_COMMIT_OF_WHEEL}."
    echo "git reset --hard ${RAY_COMMIT_OF_WHEEL}"
fi

and with vars

export RAY_TEST_REPO=https://github.com/ray-project/ray.git
export PARSED_RAY_WHEELS=https://s3-us-west-2.amazonaws.com/ray-wheels/master/0e0c15065507f01e8bfe78e49b0d0de063f81164/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
export RAY_TEST_BRANCH=master
export RAY_COMMIT_OF_WHEEL=ef5bff7fd8b00bfdc6e5d2b20ae74f299339dfbe
export HEAD_COMMIT=ec2c80dde9104d1fc355c4a0f3a2f0982cdf1056

the right content is printed. I will monitor the production when I land it.

xwjiang2010 avatar Apr 19 '23 16:04 xwjiang2010

Lint and documentation error are not related. merging now.

xwjiang2010 avatar Apr 19 '23 18:04 xwjiang2010

Testing in production

Screen Shot 2023-04-19 at 11 20 28 AM

https://buildkite.com/ray-project/release-tests-branch/builds/1583

xwjiang2010 avatar Apr 19 '23 18:04 xwjiang2010

Screen Shot 2023-04-19 at 11 35 41 AM

Verified that it works in production.

xwjiang2010 avatar Apr 19 '23 18:04 xwjiang2010