ray
ray copied to clipboard
Offline dataset refactor
Why are these changes needed?
I ran into problems when using the Offline API
with the older JsonWriter
setup when trying to use the output_config
with a custom output directory in path
. I saw the recommended DatasetReader
and ran with this and immediately into an error as pyarrow
was not installed. So I added pyarrow
to the requirements_rllib.txt
to get automatically installed for a user when installing "ray[rllib]".
I further saw that DataWriter
could simply use pathlib
and work more safely with objects instead of strings.
There is no test for the DatasetWriter
- we should write one. So I ran the test_dataset_reader.py
. That had problems with _unzip_if_needed()
using relative paths. I fixed this again with using pathlib
and refactored some further lines in test_dataset_reader.py
.
IMO we should use for the complete code in DatasetReader
the pathlib
module. Take a string path from a user and convert it to a Path
object and only when returning a path it should be converted back to a string path again.
Related issue number
None
Checks
- [x] I've signed off every commit(by using the -s flag, i.e.,
git commit -s
) in this PR. - [x] I've run
scripts/format.sh
to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [x] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(