ray icon indicating copy to clipboard operation
ray copied to clipboard

Relax check_version_info to check for bytecode compatibility

Open geofft opened this issue 2 years ago • 6 comments

As discussed in #3988, the intention of check_version_info is to ensure that cloudpickle can transfer objects across the two Python interpreters. Because cloudpickle serializes Python bytecode, it is known to be incompatible across minor versions (e.g., 3.11 to 3.12) but should be compatible across patch versions (e.g., 3.12.0 to 3.12.1).

The intention of the CPython team is that bytecode should be compatible within patch releases to the same minor version. This was last broken in 3.5.3, which was regarded as a mistake: see the commit message of python/cpython@93602e3af70d3b9f98ae2da654b16b3382b68d50 (in 3.5.10), which implies that it is reasonable to "rel[y] on pre-cached bytecode remaining valid across maintenance releases."

The constant importlib.util.MAGIC_NUMBER is used by Python to identify bytecode compatibility (it is stored in .pyc files and checked when they are loaded). The unwanted change in 3.5.3 bumped MAGIC_NUMBER, but since then, it has only changed between minor versions. See the long comment in CPython's Lib/importlib/_bootstrap_external.py for a history of the changes.

So, the practical effect of checking MAGIC_NUMBER will generally be to enforce that nodes run the same minor version of Python, but if some future patch release unexpectedly does break bytecode compatibility, checking MAGIC_NUMBER will account for that.

Why are these changes needed?

This allows using interpreters at the same minor version but a different patch version, as requested in #3988.

Related issue number

Closes #3988.

Checks

  • [X] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [ ] I've run scripts/format.sh to lint the changes in this PR.
    • but I ran black by hand
  • [X] I've included any doc changes needed for https://docs.ray.io/en/master/.
    • [X] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
  • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [ ] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(
    • I did some manual tests and will let CI get the rest of it.

geofft avatar Nov 24 '23 18:11 geofft

Your explanation makes sense to me. But I think we need to get some confirmation from cloudpickle devs saying that checking MAGIC_NUMBER is enough? cc @pcmoritz

jjyao avatar Nov 27 '23 18:11 jjyao

@jjyao This Pr would help a lot with collaboration between workers with different patch versions of Python. Are there any plans to merge it in anytime soon?

SebastianMorawiec avatar Oct 02 '24 15:10 SebastianMorawiec

This pull request has been automatically marked as stale because it has not had any activity for 14 days. It will be closed in another 14 days if no further activity occurs. Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

github-actions[bot] avatar Jun 01 '25 00:06 github-actions[bot]

I have been waiting for a response for about eighteen months. I think we can keep this open a little longer.

geofft avatar Jun 01 '25 00:06 geofft

This pull request has been automatically marked as stale because it has not had any activity for 14 days. It will be closed in another 14 days if no further activity occurs. Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

github-actions[bot] avatar Jun 16 '25 00:06 github-actions[bot]

🗿

geofft avatar Jun 16 '25 13:06 geofft

, the intention of check_version_info is to ensure that cloudpickle can transfer objects across the two Python interpreters.

@pcmoritz is there any other reason why we want to check patch version?

jjyao avatar Jun 19 '25 18:06 jjyao

, the intention of check_version_info is to ensure that cloudpickle can transfer objects across the two Python interpreters.

@pcmoritz is there any other reason why we want to check patch version?

No, I think the check proposed in this PR is good and we should adapt it :)

pcmoritz avatar Jun 19 '25 18:06 pcmoritz

This pull request has been automatically marked as stale because it has not had any activity for 14 days. It will be closed in another 14 days if no further activity occurs. Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

github-actions[bot] avatar Jul 04 '25 00:07 github-actions[bot]

@geofft sorry for the late review. After discussion we want to incorporate this PR. Do you mind addressing the comments I had before? Thanks for the contribution!

jjyao avatar Jul 08 '25 01:07 jjyao

I'm rebasing this to resolve conflicts and it seems the big conflict is with PR #42760 (issue #42356) from early 2024, a little bit after this PR. Here is my understanding of things, please correct me if I'm wrong! (It's been a while since I've thought about Ray, I've moved roles, sorry :) )

  • Originally (as of 2023 before I wrote this PR), there were two compatibility checks in Ray. Between Ray nodes, the compatibility check is for (Ray version, Python major+minor+patch version) equality, with no opt out. Between Ray Client and a node, there was also a compatibility check for (Ray protocol version, Python major+minor version) equality, with an opt out (ray.connect(ignore_version=True), or the RAY_IGNORE_VERSION_MISMATCH environment variable being set to anything).
  • This PR (#41373) changes the check between Ray nodes to (Ray version, Python bytecode version) equality.
  • #42356/#42760 changed the Ray Client compatibility check to (Ray protocol version, Python major-minor version) equality, with a logger.warning if the Python patch version is different. The Ray protocol version is no longer used and no longer exists. (The opt out remains the same.)
  • The intention of the Ray node compatibility check is that we expect to cloudpickle Python code and transfer it between nodes.
  • However, Ray Client also cloudpickles Python code and sends it to a node, if I'm understanding the example right.

So, I think the two checks should actually do the same thing. As far as I can tell, #42760 originally tried to make the Ray Client check do major+minor+patch, but that changed during the PR to keep doing the major+minor check only.

Because of all that, I think that the right thing to do is to make both checks compare (Ray version, Python bytecode version). I'm going to push a rebased version of this PR that does that, let me know if that doesn't seem right.

geofft avatar Jul 08 '25 14:07 geofft

I realized a side benefit of this change: alternative Python implementations might unexpectedly pass the version compatibility check, but they will fail the bytecode compatibility check. PyPy uses magic numbers below 3000, currently in the 400s, and GraalPy uses magic numbers in the 21000s, to avoid the range that CPython is currently using. So if you somehow try to connect to a Ray node running CPython from a client running PyPy, this PR means you will now get an error when I think it might have previously worked.

geofft avatar Jul 08 '25 16:07 geofft

I just realized the usage stats are sent to the Ray team, not just used within the cluster. This PR adds python_magic_number to the collected stats because that's what is done with ray_version and python_version, and it seems harmless and maybe a little helpful. I hope that the stat collection endpoint can handle the new key, but if not I can rework this to only use python_magic_number within the cluster for the compatibility check and not submit it as part of the stats.

geofft avatar Jul 09 '25 19:07 geofft

@geofft let me know if you need any help on this.

jjyao avatar Jul 18 '25 16:07 jjyao

Hi @geofft

Just checking to see if you still bandwidth to work on this PR and get it merged :)

jjyao avatar Jul 30 '25 22:07 jjyao

This pull request has been automatically marked as stale because it has not had any activity for 14 days. It will be closed in another 14 days if no further activity occurs. Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

github-actions[bot] avatar Aug 14 '25 00:08 github-actions[bot]

This pull request has been automatically closed because there has been no more activity in the 14 days since being marked stale.

Please feel free to reopen or open a new pull request if you'd still like this to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for your contribution!

github-actions[bot] avatar Aug 28 '25 12:08 github-actions[bot]