ray
ray copied to clipboard
[AIR][Numpy] Add numpy narrow waist to `Preprocessor` and `BatchMapper`
Why are these changes needed?
We need to add a numpy path in AIR to facilitate deep learning.
Internally we support arrow / pandas as dataset format, but user facing formats should only be pandas / numpy.
Therefore this PR also updated internal dispatch logic for inter-op among different data format & transform format combinations.
Changes
- Added _transform_numpy() to BatchMapper
- Added _transform_numpy() to Preprocessor base class
- Added
batch_format
field inBatchMapper
to matchmap_batches
behavior - Default
Preprocessor
andBatchMapper
tobatch_format="pandas"
- Removed all _transform_arrow() related code such that only pandas & numpy are valid transformation types
- For multiple column arrow / pandas table, in numpy path we transform them into Dict[str, ndarray]
- For single column arrow / pandas table, in numpy path we transform them into ndarray
Related issue number
#28346, #28522, #28524
Closes #28523
Checks
- [x] I've signed off every commit(by using the -s flag, i.e.,
git commit -s
) in this PR. - [x] I've run
scripts/format.sh
to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [x] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
So chatted with @amogkam a bit offline, a few changes we need:
-
batch_format should be
BatchMapper
only thing, and base classPreprocessor
should still have fallback paths that decides transformation format based on data type -
We only need to add numpy path to DL related preprocessors, no strong need for majority of other ones yet. In the future we should expect to see some numpy-only preprocessor, some pandas-only preprocessor and a few that implements both interface.
Failed tests are RLlib docs that irrelevant to this PR.
gym.error.NameNotFound: The environment `Pong` has been moved out of Gym to the package `ale-py`. Please install the package via `pip install ale-py`. You can instantiate the new namespaced environment as `ALE/Pong`.
Failed test are due to transient error on github (dask and hovorod on github 404)
(ray-overview/ray-libraries: line 25) broken https://github.com/dask/dask) - 404 Client Error: Not Found for url: https://github.com/dask/dask)
--
| (ray-overview/ray-libraries: line 345) broken https://github.com/explosion/spacy-ray) - 404 Client Error: Not Found for url: https://github.com/explosion/spacy-ray)
| (ray-overview/ray-libraries: line 2) broken https://github.com/facebookresearch/ClassyVision) - 404 Client Error: Not Found for url: https://github.com/facebookresearch/ClassyVision)
| (ray-overview/ray-libraries: line 71) broken https://github.com/horovod/horovod) - 404 Client Error: Not Found for url: https://github.com/horovod/horovod)
Test failures are irrelevant, due to gRPC upgrade on ray client, which this PR did not touch.