ray icon indicating copy to clipboard operation
ray copied to clipboard

[Core] ray.remote raises ValueError when used on torch IterableDataset

Open kevin85421 opened this issue 1 year ago • 5 comments

Why are these changes needed?

  • If a class has __class_getitems__ which should return a GenericAlias object, the inspect.signature(...) will throw an exception said that ValueError: no signature found for builtin type <class 'types.GenericAlias'>.

    • https://docs.python.org/3/reference/datamodel.html#class-getitem-versus-getitem

      class_getitem() should return a GenericAlias object if it is properly defined.

    • Note that __class_getitems__ is not necessary to return a GenericAlias.
  • https://docs.python.org/3.10/library/inspect.html#inspect.Signature

    Note Some callables may not be introspectable in certain implementations of Python. For example, in CPython, some built-in functions defined in C provide no metadata about their arguments.

Why can this PR avoid the ValueError?

from types import GenericAlias

@ray.remote
class MyClass:
    __class_getitem__ = classmethod(GenericAlias)

# This would allow MyClass[int] to behave similarly to specifying a generic type
myClass = MyClass.remote()
obj_ref = myClass.__class_getitem__.remote(int)
print(ray.get(obj_ref))
  • Without this PR, the type of the __class_getitem__ is method.

    • _signature_from_callable
      • isinstance(obj, types.MethodType) -> True (code)
        • _get_signature_of (i.e. partial _signature_from_callable)
          • isinstance(obj, types.MethodType) -> False (code) (type(obj) -> types.GenericAlias)
          • unwrap (code)
          • No obj.__signature__ or obj._partialmethod (code, code)
          • isfunction(obj) or _signature_is_functionlike(obj) => False (code)
          • _signature_is_builtin(obj) => False (code)
          • isinstance(obj, functools.partial) => False (code)
          • isinstance(obj, type) => True (code)
            • There is no __call__ or __new__ or __init__.
            • Throw an ValueError. (code)
  • With this PR, the type of the __class_getitem__ is method-wrapper.

    • _signature_from_callable
      • isinstance(obj, types.MethodType) -> False (code)
      • unwrap (code) => type(obj) is still method-wrapper.
      • No obj.__signature__ or obj._partialmethod (code, code)
      • isfunction(obj) or _signature_is_functionlike(obj) => False (code)
      • _signature_is_builtin(obj) => True (code) because method-wrapper is one of _NonUserDefinedCallables.

Why method = method.__init__?

Honestly, I am not 100% sure whether it will cause some potential issues or not, but

  • https://github.com/ets-labs/python-dependency-injector/commit/8cc2c1188bc92f0eeaaa212641ae659723009f26 has been merged for 3 years and has never been reverted, indicating that this solution works well.
  • The manual test below works well.
  • If any issues arise, users can define __getitem__ themselves as a workaround.

Useful links

  • https://github.com/ets-labs/python-dependency-injector/issues/398
  • https://github.com/ets-labs/python-dependency-injector/commit/8cc2c1188bc92f0eeaaa212641ae659723009f26
  • https://github.com/ets-labs/python-dependency-injector/issues/362
  • https://github.com/ets-labs/python-dependency-injector/commit/4991c5d4b0d6230a395ac90ba675502296c50619

Related issue number

Closes #42914

Checks

  • [ ] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [ ] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    • [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
  • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [x] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(
from torch.utils.data import IterableDataset
import ray

@ray.remote
class SyncDataCollector(IterableDataset):
    def __iter__(self):
        return iter(range(3, 10))

ray.init()
sdc = SyncDataCollector.remote()
obj_ref = sdc.__iter__.remote()
it = ray.get(obj_ref)
for i in it:
    print(i)
# 3 4 5 6 7 8 9
  • IterableDataset inherits from Iterable, which defines __class_getitem__ = classmethod(GenericAlias). (code)

kevin85421 avatar Feb 12 '24 22:02 kevin85421

IIUC, this is a python 3.9 bug. Do you know if it's fixed in 3.10+?

No, Python 3.10 still has the issue. I am not sure whether to call it a bug or a limitation. The official documentation for inspect.signature already acknowledges that "Note Some callables may not be introspectable in certain implementations of Python. For example, in CPython, some built-in functions defined in C provide no metadata about their arguments".

kevin85421 avatar Feb 15 '24 22:02 kevin85421

The premerge fails on test_classmethod_genericalias and test_placement_group_4.py. I tried to reproduce locally with Python 3.8 and Python 3.10, but I can't.

  • Python 3.10 Screen Shot 2024-02-15 at 7 53 29 PM

  • Python 3.8 Screen Shot 2024-02-15 at 5 49 13 PM

kevin85421 avatar Feb 16 '24 04:02 kevin85421

Python 3.9.0 still can't reproduce the issue.

Screen Shot 2024-02-15 at 9 41 48 PM

kevin85421 avatar Feb 16 '24 05:02 kevin85421

I installed the Python 3.9 artifact on the Buildkite CI, and then replaced the Ray Python scripts with my local ones. I still can't reproduce the CI error.

Screen Shot 2024-02-16 at 4 45 47 PM

kevin85421 avatar Feb 17 '24 00:02 kevin85421

I used https://github.com/ray-project/ray/pull/43117/commits/5186b34951fb317beea59db55997e408d1184e76 to verify that the CI has actually entered the following if statement as expected. The repro CI currently doesn't support pre-merge. I have already spent a lot of time trying to reproduce this strange issue. I will reach out to the CI team to discuss how to reproduce it.

if GenericAlias and any(
    (
        method is GenericAlias,
        getattr(method, "__func__", None) is GenericAlias,
    )
):
    method = method.__init__

kevin85421 avatar Feb 18 '24 00:02 kevin85421

Is this basically a bug from inspect module?

Yes. See https://github.com/ray-project/ray/pull/43117#issuecomment-1947459842 for more details.

Do you know what's the timeline for the fix in Python level? (or is it fixed in the latest versions)?

No, I don't think there is a timeline for that. It has lasted for more than 3 years.

Also this method = method.init is only applied to class_getitme right? Other methods are not affected

I believe so because I can still get 3 4 5 6 7 8 9 in the following example.

from torch.utils.data import IterableDataset
import ray

@ray.remote
class SyncDataCollector(IterableDataset):
    def __iter__(self):
        return iter(range(3, 10))

ray.init()
sdc = SyncDataCollector.remote()
obj_ref = sdc.__iter__.remote()
it = ray.get(obj_ref)
for i in it:
    print(i)
# 3 4 5 6 7 8 9

kevin85421 avatar Feb 20 '24 23:02 kevin85421

Btw, I am very comfortable merging this PR. I feel a bit hacky that we are monkey patching the method. The CI failure also makes me feel like this is not a good direction.

Is this reasonable we just ask for the workaround (creating an instance inside an actor instead of directly adding .remote)? Wdyt @kevin85421 ? The issue was also originally p2

rkooo567 avatar Mar 01 '24 22:03 rkooo567

Is this reasonable we just ask for the workaround (creating an instance inside an actor instead of directly adding .remote)?

Do you mean something like below? If so, it makes sense to me.

@ray.remote
class SyncDataCollector:
    def __init__(self):
        self.xxxx = xxxxx # something inherited from IterableDataset.
    def __iter__(self):
        # call self.xxxx

kevin85421 avatar Mar 01 '24 22:03 kevin85421

yes. I prefer to ask users to workaround than adding dogpatches unless the issue is widely exists (and critical). We've actually suffered from some dogpatches we made in the past multiple times and had to revert at the end. And the fact that Python doesn't try fix this issue also makes me a little concern (there could be a reason why it is not fixed in their layer like there's an anti pattern) though for this part, we can do a bit more sesearch.

rkooo567 avatar Mar 03 '24 02:03 rkooo567