transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Restore bf16 support for Neuron after PR #22300

Open jeffhataws opened this issue 1 year ago • 3 comments

This PR fixes the "RuntimeError: No CUDA GPUs are available" when running with --bf16 option on Neuron.

Related PRs: https://github.com/huggingface/transformers/pull/20684 https://github.com/huggingface/transformers/pull/22300

What does this PR do?

While PR #22300 restores fp16 option on XLA GPU device, it causes "RuntimeError: No CUDA GPUs are available" when running with --bf16 option on Neuron. This PR fixes this error.

Fixes # (issue)

Before submitting

  • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [X] Did you read the contributor guideline, Pull Request section?
  • [X] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • [ ] Did you write any new necessary tests? (Manual test below)
export TASK_NAME=mrpc
python3 ./run_glue.py \
--model_name_or_path bert-large-uncased \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--bf16 \
--max_seq_length 128 \
--per_device_train_batch_size 8 \
--learning_rate 2e-5 \
--num_train_epochs 5 \
--overwrite_output_dir \
--output_dir /tmp/$TASK_NAME/ |& tee log_run
***** train metrics *****
  epoch                    =        5.0
  train_loss               =     0.2457
  train_runtime            = 0:10:05.68
  train_samples            =       3668
  train_samples_per_second =      30.28
  train_steps_per_second   =      3.789
100%|██████████| 51/51 [00:03<00:00, 16.56it/s]
***** eval metrics *****
  epoch                   =        5.0
  eval_accuracy           =     0.8554
  eval_combined_score     =     0.8762
  eval_f1                 =      0.897
  eval_loss               =     0.8809
  eval_runtime            = 0:00:14.30
  eval_samples            =        408
  eval_samples_per_second =     28.524
  eval_steps_per_second   =      3.566

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@sgugger @ymwangg @Lokiiiiii

jeffhataws avatar Mar 22 '23 04:03 jeffhataws

The documentation is not available anymore as the PR was closed or merged.

This means no mixed precision at all will be used during training as this variable controls the autocast context manager.

@sgugger could you help point me to the autocast context manager? Is there a way to make it use PyTorch autocast instead of cuda.amp.autocast?

jeffhataws avatar Mar 22 '23 17:03 jeffhataws

The autocast context manager is defined here.

As for your question on torch.autocast, we can't use it as it's only in very recent versions of PyTorch and we support PyTorch >= 1.9

sgugger avatar Mar 22 '23 17:03 sgugger

The autocast context manager is defined here.

As for your question on torch.autocast, we can't use it as it's only in very recent versions of PyTorch and we support PyTorch >= 1.9

Ok. Thanks @sgugger . Please see my revised PR. It does resolve the runtime error while keeping the autocast functionality.

jeffhataws avatar Mar 23 '23 04:03 jeffhataws

Mmm we cannot patch torch like this in Transformers as it's too magical and might yield to hard-to-debug issues for the users.

sgugger avatar Mar 23 '23 12:03 sgugger

Mmm we cannot patch torch like this in Transformers as it's too magical and might yield to hard-to-debug issues for the users.

Thanks. Please take a look at the new revision. I switched to cpu_amp.

jeffhataws avatar Mar 23 '23 15:03 jeffhataws

Mmm we cannot patch torch like this in Transformers as it's too magical and might yield to hard-to-debug issues for the users.

@sgugger looks like using cpu_amp did not yield expected result, as the XLA/HLO graphs generated still all have fp32 ports so effectively bf16 flag has no effect. The only way I can get it to work is to use gpu_amp with the override "torch.cuda.is_bf16_supported = lambda: True" which is limited to Neuron (if is_torch_neuroncore_available) and thus will be using torch_neuronx package and not using torch.cuda anyways so it is safe. Let me know if it is still acceptable, and I will resubmit a revision.

jeffhataws avatar Mar 24 '23 21:03 jeffhataws

I don't understand why it is necessary to patch torch.cuda for something you are telling me will not use torch.cuda anyway. Looks like there is some specific neuroncore tests that are necessary to fix the issue, but as I said before, patching torch.cuda is too magical to be accepted in Transformers. The only patch to other modules we accept are those done briefly inside a context manager.

sgugger avatar Mar 27 '23 13:03 sgugger

I don't understand why it is necessary to patch torch.cuda for something you are telling me will not use torch.cuda anyway. Looks like there is some specific neuroncore tests that are necessary to fix the issue, but as I said before, patching torch.cuda is too magical to be accepted in Transformers. The only patch to other modules we accept are those done briefly inside a context manager.

By "not using torch.cuda anyways" I meant we use the GPU AMP feature to autocast to bfloat16, but once that's done, the rest is executed on Neuron. I will keep debugging, but the CPU AMP feature is not working well with pytorch XLA.

jeffhataws avatar Mar 27 '23 18:03 jeffhataws

@sgugger I have posted a revert here https://github.com/huggingface/transformers/pull/22451 . Apologies for the extra work.

jeffhataws avatar Mar 29 '23 16:03 jeffhataws