Justin Yu issues

Results 23 issues of


                                            Justin Yu

`run_example_debug` on Mac OS doesn't save videos to the right directory

In debug mode, saves to "softlearning/videos" instead of under the user's home directory "~/ray_results/...". Works fine using `run_example_local`.

[Tune] [PBT] Maintain consistent `Trial`/`TrialRunner` state when pausing and resuming trial

## Why are these changes needed? ### The problem - When running *synchronous PBT* while checkpointing every time a perturbation happens, the experiment can reach a state where trial A...

bug

tune

[Tune] [PBT] Add a flag to force an in-memory checkpoint to be used via `Trial.on_checkpoint`

## Why are these changes needed? This change is needed for the PBT algorithm to run correctly in the case where persistent checkpoints and in-memory checkpoints are both being saved....

tune

[Tune][CI] Skip zoopt invalid values test

Skips a zoopt searcher test that's causing the `test_searchers` suite in CI to be flaky. Skipping as this is not a Tune issue and needs to be fixed in the...

[CI] `linux://python/ray/tune:test_commands` is failing/flaky on master.

- ddfeae3c86b997e7a0bf1391f37d2831a2da3542 FAILED [Buildkite :octopus: Tune tests and examples (small)](https://buildkite.com/ray-project/oss-ci-build-branch/builds/1195#01849a2c-4c71-4d3e-ae06-33f0c90a61e8) - 073e7bc04d989607848552537f9f5ac91fa07d85 FAILED [Buildkite :octopus: Tune tests and examples (small)](https://buildkite.com/ray-project/oss-ci-build-branch/builds/1194#01849a26-f367-434d-8c81-32cbe16ad16d) - df76ac7975334a5fec7affcc910076ca435fb772 FAILED [Buildkite :octopus: Tune tests and examples (small)](https://buildkite.com/ray-project/oss-ci-build-branch/builds/1152#01848b8b-8cf0-4522-a079-fb14d0b365ec)...

tune

stale

flaky-tracker

[Templates] Unify the batch inference template with an existing Data example

This PR de-duplicates the batch inference template by making it the same as the existing pytorch gpu batch inference example. There still needs to be a copy due to relative...

tests-ok

[release/air] Upgrade release tests that depend on upgraded modin to py38

## Why are these changes needed? https://github.com/ray-project/ray/pull/30895 upgraded the pinned version of `modin`, removing support for python Closes https://github.com/ray-project/ray/issues/36299 ## Checks - [ ] I've signed off every commit(by using...

release-test

[release/air] `lightning_gpu_tune_3x16_3x1.aws` is flaky due to `LightningTrainer` not working with `PBT`

[Latest job link](https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_xstq7e5eqi59lzdx3u68xbprwu) This is flaky because we only run 2 trials, so most of the time the network architectures of both trials may be the same, so checkpoints can...

tune

train

release-test

[Tune] `bohb_example` patch fix

## Why are these changes needed? The `bohb_example` is super flaky after https://github.com/ray-project/ray/pull/35338 modified the example to run more trials. This is just a patch fix to deflake the example....

[release/air] Fix `air_example_gptj_deepspeed_fine_tuning.gce` failing to pull model from a public s3 bucket

## Why are these changes needed? This PR fixes the `air_example_gptj_deepspeed_fine_tuning.gce` release test. It was failing due to our GCE nodes not having an AWS credentials file. This is not...

release-test