ray icon indicating copy to clipboard operation
ray copied to clipboard

Offline dataset refactor

Open simonsays1980 opened this issue 2 years ago • 0 comments

Why are these changes needed?

I ran into problems when using the Offline API with the older JsonWriter setup when trying to use the output_config with a custom output directory in path. I saw the recommended DatasetReader and ran with this and immediately into an error as pyarrow was not installed. So I added pyarrow to the requirements_rllib.txt to get automatically installed for a user when installing "ray[rllib]".

I further saw that DataWriter could simply use pathlib and work more safely with objects instead of strings.

There is no test for the DatasetWriter - we should write one. So I ran the test_dataset_reader.py. That had problems with _unzip_if_needed() using relative paths. I fixed this again with using pathlib and refactored some further lines in test_dataset_reader.py.

IMO we should use for the complete code in DatasetReader the pathlib module. Take a string path from a user and convert it to a Path object and only when returning a path it should be converted back to a string path again.

Related issue number

None

Checks

  • [x] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [x] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
  • [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [x] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

simonsays1980 avatar Nov 27 '22 20:11 simonsays1980