[release/air] Fix `air_example_gptj_deepspeed_fine_tuning.gce` failing to pull model from a public s3 bucket
Why are these changes needed?
This PR fixes the air_example_gptj_deepspeed_fine_tuning.gce release test. It was failing due to our GCE nodes not having an AWS credentials file. This is not needed due to the s3 bucket being public, so we just pass a --no-sign-request flag to use AWS cli as an anonymous user. This also removes the --quiet flag since this cell is not shown to users anyways, and it'd help us catch some aws cli error in the future.
Related issue number
Closes https://github.com/ray-project/ray/issues/36274
Checks
- [ ] I've signed off every commit(by using the -s flag, i.e.,
git commit -s) in this PR. - [ ] I've run
scripts/format.shto lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
doc/source/tune/api/under the corresponding.rstfile.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
Running release test: https://buildkite.com/ray-project/release-tests-pr/builds/41810
GCE: 55 minutes (training time) + 6 minutes (model download time) --> > 1 hour
Training finished iteration 85 at 2023-06-09 18:01:20. Total running time: 55min 9s
╭────────────────────────────────────╮
│ Training result │
├────────────────────────────────────┤
│ time_this_iter_s 104.014 │
│ time_total_s 3301.77 │
│ training_iteration 85 │
│ epoch 1 │
│ learning_rate 0 │
│ loss 0.0715 │
│ step 85 │
│ train_loss 0.32492 │
│ train_runtime 3058.86 │
│ train_samples_per_second 0.441 │
│ train_steps_per_second 0.028 │
╰────────────────────────────────────╯
AWS: ~45 minutes (training time) + 5 minutes (model download time) = 50 minutes (~10 minutes faster in total)
+---------------------------------+------------+------------------+--------+------------------+--------+-----------------+---------+
| Trial name | status | loc | iter | total time (s) | loss | learning_rate | epoch |
|---------------------------------+------------+------------------+--------+------------------+--------+-----------------+---------|
| TransformersTrainer_85e6a_00000 | TERMINATED | 10.0.59.140:4477 | 85 | 2708.04 | 0.0715 | 4.70588e-07 | 1 |
+---------------------------------+------------+------------------+--------+------------------+--------+-----------------+---------+
@krfricke The GCE version is taking like 10 minutes longer for some reason. The instances used between GCE and AWS are slightly different which could be the reason, but I'm wondering if there's another underlying problem here.
Let's increase the timeout as you suggest and then we can investigate what takes the GCE 10 minutes longer after.
New release test runs: https://buildkite.com/ray-project/release-tests-pr/builds/42246#0188bbb4-5953-4b12-914c-f760a784436b
The tests are now failing due to a new HF datasets release. This PR now also includes a fix for that.
See release tests running here: https://buildkite.com/ray-project/release-tests-pr/builds/42420
Latest successful run of release tests: https://buildkite.com/ray-project/release-tests-pr/builds/42460