ray icon indicating copy to clipboard operation
ray copied to clipboard

[release/air] Fix `air_example_gptj_deepspeed_fine_tuning.gce` failing to pull model from a public s3 bucket

Open justinvyu opened this issue 2 years ago • 3 comments

Why are these changes needed?

This PR fixes the air_example_gptj_deepspeed_fine_tuning.gce release test. It was failing due to our GCE nodes not having an AWS credentials file. This is not needed due to the s3 bucket being public, so we just pass a --no-sign-request flag to use AWS cli as an anonymous user. This also removes the --quiet flag since this cell is not shown to users anyways, and it'd help us catch some aws cli error in the future.

Related issue number

Closes https://github.com/ray-project/ray/issues/36274

Checks

  • [ ] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [ ] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    • [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
  • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [ ] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

justinvyu avatar Jun 09 '23 19:06 justinvyu

Running release test: https://buildkite.com/ray-project/release-tests-pr/builds/41810

justinvyu avatar Jun 09 '23 21:06 justinvyu

GCE: 55 minutes (training time) + 6 minutes (model download time) --> > 1 hour

Training finished iteration 85 at 2023-06-09 18:01:20. Total running time: 55min 9s
╭────────────────────────────────────╮
│ Training result                    │
├────────────────────────────────────┤
│ time_this_iter_s           104.014 │
│ time_total_s               3301.77 │
│ training_iteration              85 │
│ epoch                            1 │
│ learning_rate                    0 │
│ loss                        0.0715 │
│ step                            85 │
│ train_loss                 0.32492 │
│ train_runtime              3058.86 │
│ train_samples_per_second     0.441 │
│ train_steps_per_second       0.028 │
╰────────────────────────────────────╯

AWS: ~45 minutes (training time) + 5 minutes (model download time) = 50 minutes (~10 minutes faster in total)

+---------------------------------+------------+------------------+--------+------------------+--------+-----------------+---------+
| Trial name                      | status     | loc              |   iter |   total time (s) |   loss |   learning_rate |   epoch |
|---------------------------------+------------+------------------+--------+------------------+--------+-----------------+---------|
| TransformersTrainer_85e6a_00000 | TERMINATED | 10.0.59.140:4477 |     85 |          2708.04 | 0.0715 |     4.70588e-07 |       1 |
+---------------------------------+------------+------------------+--------+------------------+--------+-----------------+---------+

justinvyu avatar Jun 12 '23 05:06 justinvyu

@krfricke The GCE version is taking like 10 minutes longer for some reason. The instances used between GCE and AWS are slightly different which could be the reason, but I'm wondering if there's another underlying problem here.

Let's increase the timeout as you suggest and then we can investigate what takes the GCE 10 minutes longer after.

New release test runs: https://buildkite.com/ray-project/release-tests-pr/builds/42246#0188bbb4-5953-4b12-914c-f760a784436b

justinvyu avatar Jun 14 '23 20:06 justinvyu

The tests are now failing due to a new HF datasets release. This PR now also includes a fix for that.

See release tests running here: https://buildkite.com/ray-project/release-tests-pr/builds/42420

justinvyu avatar Jun 15 '23 23:06 justinvyu

Latest successful run of release tests: https://buildkite.com/ray-project/release-tests-pr/builds/42460

justinvyu avatar Jun 16 '23 17:06 justinvyu