ray icon indicating copy to clipboard operation
ray copied to clipboard

[AIR][Numpy] Add numpy narrow waist to `Preprocessor` and `BatchMapper`

Open jiaodong opened this issue 2 years ago • 1 comments

Why are these changes needed?

We need to add a numpy path in AIR to facilitate deep learning.

Internally we support arrow / pandas as dataset format, but user facing formats should only be pandas / numpy.

Therefore this PR also updated internal dispatch logic for inter-op among different data format & transform format combinations.

Changes

  • Added _transform_numpy() to BatchMapper
  • Added _transform_numpy() to Preprocessor base class
  • Added batch_format field in BatchMapper to match map_batches behavior
  • Default Preprocessor and BatchMapper to batch_format="pandas"
  • Removed all _transform_arrow() related code such that only pandas & numpy are valid transformation types
  • For multiple column arrow / pandas table, in numpy path we transform them into Dict[str, ndarray]
  • For single column arrow / pandas table, in numpy path we transform them into ndarray

Related issue number

#28346, #28522, #28524

Closes #28523

Checks

  • [x] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [x] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
  • [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [x] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

jiaodong avatar Sep 10 '22 00:09 jiaodong

So chatted with @amogkam a bit offline, a few changes we need:

  1. batch_format should be BatchMapper only thing, and base class Preprocessor should still have fallback paths that decides transformation format based on data type

  2. We only need to add numpy path to DL related preprocessors, no strong need for majority of other ones yet. In the future we should expect to see some numpy-only preprocessor, some pandas-only preprocessor and a few that implements both interface.

jiaodong avatar Sep 20 '22 19:09 jiaodong

Failed tests are RLlib docs that irrelevant to this PR.

gym.error.NameNotFound: The environment `Pong` has been moved out of Gym to the package `ale-py`. Please install the package via `pip install ale-py`. You can instantiate the new namespaced environment as `ALE/Pong`.

jiaodong avatar Sep 22 '22 15:09 jiaodong

Failed test are due to transient error on github (dask and hovorod on github 404)

(ray-overview/ray-libraries: line   25) broken    https://github.com/dask/dask) - 404 Client Error: Not Found for url: https://github.com/dask/dask)
--
  | (ray-overview/ray-libraries: line  345) broken    https://github.com/explosion/spacy-ray) - 404 Client Error: Not Found for url: https://github.com/explosion/spacy-ray)
  | (ray-overview/ray-libraries: line    2) broken    https://github.com/facebookresearch/ClassyVision) - 404 Client Error: Not Found for url: https://github.com/facebookresearch/ClassyVision)
  | (ray-overview/ray-libraries: line   71) broken    https://github.com/horovod/horovod) - 404 Client Error: Not Found for url: https://github.com/horovod/horovod)

jiaodong avatar Sep 22 '22 22:09 jiaodong

Test failures are irrelevant, due to gRPC upgrade on ray client, which this PR did not touch.

jiaodong avatar Sep 27 '22 16:09 jiaodong