transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Code crashes without errors when importing Trainer in TPU context

Open samuele-bortolato opened this issue 1 year ago • 4 comments

System Info

I'm working on Kaggle with TPU enabled (TPU VM v3-8), running !transformers-cli env returns

[libprotobuf ERROR external/com_google_protobuf/src/google/protobuf/descriptor_database.cc:642] File already exists in database: tsl/profiler/protobuf/trace_events.proto [libprotobuf FATAL external/com_google_protobuf/src/google/protobuf/descriptor.cc:1986] CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size): terminate called after throwing an instance of 'google::protobuf::FatalException' what(): CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size): https://symbolize.stripped_domain/r/?trace=7a80dd07fd3c,7a80dd030fcf,5ab82e3a7b8f&map= *** SIGABRT received by PID 367 (TID 367) on cpu 95 from PID 367; stack trace: *** PC: @ 0x7a80dd07fd3c (unknown) (unknown) @ 0x7a7f654bba19 928 (unknown) @ 0x7a80dd030fd0 (unknown) (unknown) @ 0x5ab82e3a7b90 (unknown) (unknown) https://symbolize.stripped_domain/r/?trace=7a80dd07fd3c,7a7f654bba18,7a80dd030fcf,5ab82e3a7b8f&map=310b7ae7682f84c5c576a0b0030121f2:7a7f56a00000-7a7f656d11c0 E0119 15:49:22.169993 367 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked. E0119 15:49:22.170011 367 client.cc:272] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec. E0119 15:49:22.170016 367 coredump_hook.cc:542] RAW: Sending fingerprint to remote end. E0119 15:49:22.170041 367 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory E0119 15:49:22.170050 367 coredump_hook.cc:603] RAW: Dumping core locally. E0119 15:50:17.482782 367 process_state.cc:808] RAW: Raising signal 6 with default behavior Aborted (core dumped)

Importing and printing manually

import torch_xla
print(torch_xla.__version__)

2.1.0+libtpu

import torch
print(torch.__version__)

2.1.0+cu121

import transformers
print(transformers.__version__)

4.36.2

Who can help?

@muellerzr @stevhliu

I have been tryint to port my code to TPU, but cannot manage to import the libraries.

In my code (written in pytorch) I use the transformer library to load some pretrained LLMs and I subclassed the Trainer class to train some custom models with RL.

The code is working perfectly fine on GPU, but I can't manage to make it work on TPU and the code keeps crashing without returning any error. The documentation on how to use TPUs in the transformer library for a torch backend is still not present (after two years that the page was created in the documentation https://huggingface.co/docs/transformers/v4.21.3/en/perf_train_tpu), so I have no idea if I skipped any necessary step.

While the code imports without problems the transformer library, the whole session crashes when I try to import the Trainer class.

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

import torch_xla
print(torch_xla.__version__)

import torch
print(torch.__version__)

import transformers
print(transformers.__version__)

from transformers import Trainer

output: ->2.1.0+libtpu ->2.1.0+cu121 ->4.36.2 ->(crash session without outputs)

Expected behavior

It should either import the library or throw an error, not crash the whole session without a hint.

samuele-bortolato avatar Jan 19 '24 16:01 samuele-bortolato

I would like to work on this

naseemx avatar Jan 19 '24 16:01 naseemx

I have the same problem.

ILG2021 avatar Jan 24 '24 04:01 ILG2021

having same kaggle issue

phineas-pta avatar Feb 15 '24 13:02 phineas-pta

Gentle ping @muellerzr

amyeroberts avatar Feb 15 '24 20:02 amyeroberts

The torch_xla team is aware of this and working towards fixing it

muellerzr avatar Mar 11 '24 13:03 muellerzr

@muellerzr is there a PR or Issue we can track and link here?

ArthurZucker avatar Apr 05 '24 12:04 ArthurZucker

Having the same issue on Kaggle, any update?

sitatec avatar Apr 05 '24 17:04 sitatec

@muellerzr In case it may help, when I try to import it Trainer or SFTTrainer in the VM no error is printed, but when I launch the script that contains the import on the TPU with accelerate launch or notebook_launcher I get this error message: ERROR: Unknown command line flag 'xla_latency_hiding_scheduler_rerun'

I was facing a similar issue (but different error message) on GPU as well but when installed the latest versions of the hugging faces libraries that I was using, it fixed the issue:

!pip install \
git+https://github.com/huggingface/transformers.git \
git+https://github.com/huggingface/datasets.git \
git+https://github.com/huggingface/trl.git \
git+https://github.com/huggingface/peft.git \
git+https://github.com/huggingface/accelerate.git \

But this doesn't fix it on TPU.

sitatec avatar Apr 05 '24 17:04 sitatec

xla_latency_hiding_scheduler_rerun is a XLA flag we set default value in https://github.com/pytorch/xla/blob/66ed39ba5fa6fb487790df03a9a68a6f62f2c957/torch_xla/init.py#L46

Do you mind doing a quick santity check following https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#check-pytorchxla-version ? I believe we have special whl built for kaggle that bundle the libtpu with pytorch/xla so you shouldn't need to manually install libtpu..

JackCaoG avatar Apr 09 '24 01:04 JackCaoG

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar May 03 '24 08:05 github-actions[bot]

Having the same issue here

elemosel avatar May 26 '24 07:05 elemosel

We found out that the issue is tensorflow(TPU version, tensorflow-cpu is fine) will always tried to load the libtpu first upon import. To overcome this issue you can pip uninstall tensorflow. starting from 2.4 release we will throw a warning message if tf is installed on the same host.

JackCaoG avatar May 28 '24 17:05 JackCaoG

Thanks for sharing @JackCaoG! Cc @Rocketknight1 for reference

amyeroberts avatar May 29 '24 08:05 amyeroberts

A life savor! Thank you for the uninstall tensorflow solution

j1wonkim avatar Jul 10 '24 07:07 j1wonkim