transformers
transformers copied to clipboard
Restore bf16 support for Neuron after PR #22300
This PR fixes the "RuntimeError: No CUDA GPUs are available" when running with --bf16 option on Neuron.
Related PRs: https://github.com/huggingface/transformers/pull/20684 https://github.com/huggingface/transformers/pull/22300
What does this PR do?
While PR #22300 restores fp16 option on XLA GPU device, it causes "RuntimeError: No CUDA GPUs are available" when running with --bf16 option on Neuron. This PR fixes this error.
Fixes # (issue)
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [X] Did you read the contributor guideline, Pull Request section?
- [X] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [ ] Did you write any new necessary tests? (Manual test below)
export TASK_NAME=mrpc
python3 ./run_glue.py \
--model_name_or_path bert-large-uncased \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--bf16 \
--max_seq_length 128 \
--per_device_train_batch_size 8 \
--learning_rate 2e-5 \
--num_train_epochs 5 \
--overwrite_output_dir \
--output_dir /tmp/$TASK_NAME/ |& tee log_run
***** train metrics *****
epoch = 5.0
train_loss = 0.2457
train_runtime = 0:10:05.68
train_samples = 3668
train_samples_per_second = 30.28
train_steps_per_second = 3.789
100%|██████████| 51/51 [00:03<00:00, 16.56it/s]
***** eval metrics *****
epoch = 5.0
eval_accuracy = 0.8554
eval_combined_score = 0.8762
eval_f1 = 0.897
eval_loss = 0.8809
eval_runtime = 0:00:14.30
eval_samples = 408
eval_samples_per_second = 28.524
eval_steps_per_second = 3.566
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
@sgugger @ymwangg @Lokiiiiii
The documentation is not available anymore as the PR was closed or merged.
This means no mixed precision at all will be used during training as this variable controls the autocast context manager.
@sgugger could you help point me to the autocast context manager? Is there a way to make it use PyTorch autocast instead of cuda.amp.autocast?
The autocast context manager is defined here.
As for your question on torch.autocast
, we can't use it as it's only in very recent versions of PyTorch and we support PyTorch >= 1.9
The autocast context manager is defined here.
As for your question on
torch.autocast
, we can't use it as it's only in very recent versions of PyTorch and we support PyTorch >= 1.9
Ok. Thanks @sgugger . Please see my revised PR. It does resolve the runtime error while keeping the autocast functionality.
Mmm we cannot patch torch like this in Transformers as it's too magical and might yield to hard-to-debug issues for the users.
Mmm we cannot patch torch like this in Transformers as it's too magical and might yield to hard-to-debug issues for the users.
Thanks. Please take a look at the new revision. I switched to cpu_amp.
Mmm we cannot patch torch like this in Transformers as it's too magical and might yield to hard-to-debug issues for the users.
@sgugger looks like using cpu_amp did not yield expected result, as the XLA/HLO graphs generated still all have fp32 ports so effectively bf16 flag has no effect. The only way I can get it to work is to use gpu_amp with the override "torch.cuda.is_bf16_supported = lambda: True" which is limited to Neuron (if is_torch_neuroncore_available) and thus will be using torch_neuronx package and not using torch.cuda anyways so it is safe. Let me know if it is still acceptable, and I will resubmit a revision.
I don't understand why it is necessary to patch torch.cuda for something you are telling me will not use torch.cuda anyway. Looks like there is some specific neuroncore tests that are necessary to fix the issue, but as I said before, patching torch.cuda is too magical to be accepted in Transformers. The only patch to other modules we accept are those done briefly inside a context manager.
I don't understand why it is necessary to patch torch.cuda for something you are telling me will not use torch.cuda anyway. Looks like there is some specific neuroncore tests that are necessary to fix the issue, but as I said before, patching torch.cuda is too magical to be accepted in Transformers. The only patch to other modules we accept are those done briefly inside a context manager.
By "not using torch.cuda anyways" I meant we use the GPU AMP feature to autocast to bfloat16, but once that's done, the rest is executed on Neuron. I will keep debugging, but the CPU AMP feature is not working well with pytorch XLA.
@sgugger I have posted a revert here https://github.com/huggingface/transformers/pull/22451 . Apologies for the extra work.