transformers
transformers copied to clipboard
Code crashes without errors when importing Trainer in TPU context
System Info
I'm working on Kaggle with TPU enabled (TPU VM v3-8), running !transformers-cli env returns
[libprotobuf ERROR external/com_google_protobuf/src/google/protobuf/descriptor_database.cc:642] File already exists in database: tsl/profiler/protobuf/trace_events.proto [libprotobuf FATAL external/com_google_protobuf/src/google/protobuf/descriptor.cc:1986] CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size): terminate called after throwing an instance of 'google::protobuf::FatalException' what(): CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size): https://symbolize.stripped_domain/r/?trace=7a80dd07fd3c,7a80dd030fcf,5ab82e3a7b8f&map= *** SIGABRT received by PID 367 (TID 367) on cpu 95 from PID 367; stack trace: *** PC: @ 0x7a80dd07fd3c (unknown) (unknown) @ 0x7a7f654bba19 928 (unknown) @ 0x7a80dd030fd0 (unknown) (unknown) @ 0x5ab82e3a7b90 (unknown) (unknown) https://symbolize.stripped_domain/r/?trace=7a80dd07fd3c,7a7f654bba18,7a80dd030fcf,5ab82e3a7b8f&map=310b7ae7682f84c5c576a0b0030121f2:7a7f56a00000-7a7f656d11c0 E0119 15:49:22.169993 367 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked. E0119 15:49:22.170011 367 client.cc:272] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec. E0119 15:49:22.170016 367 coredump_hook.cc:542] RAW: Sending fingerprint to remote end. E0119 15:49:22.170041 367 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory E0119 15:49:22.170050 367 coredump_hook.cc:603] RAW: Dumping core locally. E0119 15:50:17.482782 367 process_state.cc:808] RAW: Raising signal 6 with default behavior Aborted (core dumped)
Importing and printing manually
import torch_xla
print(torch_xla.__version__)
2.1.0+libtpu
import torch
print(torch.__version__)
2.1.0+cu121
import transformers
print(transformers.__version__)
4.36.2
Who can help?
@muellerzr @stevhliu
I have been tryint to port my code to TPU, but cannot manage to import the libraries.
In my code (written in pytorch) I use the transformer library to load some pretrained LLMs and I subclassed the Trainer class to train some custom models with RL.
The code is working perfectly fine on GPU, but I can't manage to make it work on TPU and the code keeps crashing without returning any error. The documentation on how to use TPUs in the transformer library for a torch backend is still not present (after two years that the page was created in the documentation https://huggingface.co/docs/transformers/v4.21.3/en/perf_train_tpu), so I have no idea if I skipped any necessary step.
While the code imports without problems the transformer library, the whole session crashes when I try to import the Trainer class.
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
import torch_xla
print(torch_xla.__version__)
import torch
print(torch.__version__)
import transformers
print(transformers.__version__)
from transformers import Trainer
output: ->2.1.0+libtpu ->2.1.0+cu121 ->4.36.2 ->(crash session without outputs)
Expected behavior
It should either import the library or throw an error, not crash the whole session without a hint.
I would like to work on this
I have the same problem.
having same kaggle issue
Gentle ping @muellerzr
The torch_xla team is aware of this and working towards fixing it
@muellerzr is there a PR or Issue we can track and link here?
Having the same issue on Kaggle, any update?
@muellerzr In case it may help, when I try to import it Trainer
or SFTTrainer
in the VM no error is printed, but when I launch the script that contains the import on the TPU with accelerate launch
or notebook_launcher
I get this error message:
ERROR: Unknown command line flag 'xla_latency_hiding_scheduler_rerun'
I was facing a similar issue (but different error message) on GPU as well but when installed the latest versions of the hugging faces libraries that I was using, it fixed the issue:
!pip install \
git+https://github.com/huggingface/transformers.git \
git+https://github.com/huggingface/datasets.git \
git+https://github.com/huggingface/trl.git \
git+https://github.com/huggingface/peft.git \
git+https://github.com/huggingface/accelerate.git \
But this doesn't fix it on TPU.
xla_latency_hiding_scheduler_rerun
is a XLA flag we set default value in https://github.com/pytorch/xla/blob/66ed39ba5fa6fb487790df03a9a68a6f62f2c957/torch_xla/init.py#L46
Do you mind doing a quick santity check following https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#check-pytorchxla-version ? I believe we have special whl built for kaggle that bundle the libtpu with pytorch/xla so you shouldn't need to manually install libtpu..
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Having the same issue here
We found out that the issue is tensorflow
(TPU version, tensorflow-cpu is fine) will always tried to load the libtpu first upon import. To overcome this issue you can pip uninstall tensorflow
. starting from 2.4 release we will throw a warning message if tf is installed on the same host.
Thanks for sharing @JackCaoG! Cc @Rocketknight1 for reference
A life savor! Thank you for the uninstall tensorflow solution