xla
xla copied to clipboard
Gemma finetuning on Kaggle TPU doesn't work
🐛 Bug
Not sure if this is a feature request or bug. I took the SPMD Gemma ft code from Hugging Face and tried to run it on Kaggle; it didn't work.
trl seems to have an issue there.
To Reproduce
See my Kaggle notebook.
Expected behavior
Ideally it should run.
Environment
- Reproducible on XLA backend [CPU/TPU/CUDA]: TPU
- torch_xla version:
Stock Kaggle env.
Additional context
OK, seems that code is for Cloud TPU only as mentioned in this HF blog. Then this is a feature request.
@alanwaketan
🐛 Bug
Not sure if this is a feature request or bug. I took the SPMD Gemma ft code from Hugging Face and tried to run it on Kaggle; it didn't work.
trl seems to have an issue there.
To Reproduce
See my Kaggle notebook.
Expected behavior
Ideally it should run.
Environment
- Reproducible on XLA backend [CPU/TPU/CUDA]: TPU
- torch_xla version:
Stock Kaggle env.
Additional context
Kaggle is using Older version of torch-xla where distributed.spmd is not implemented
OK, seems that code is for Cloud TPU only as mentioned in this HF blog. Then this is a feature request.
kaggle is using older version of torch-xla where torch.distributed.spmd was not implemented , would recommend to upgrade torch-xla
!pip install torch~=2.2.0 torch_xla[tpu]~=2.2.0 -f https://storage.googleapis.com/libtpu-releases/index.html
@windmaple You need to install the nightly torch-xla and torch.
Kaggle VM just silently dies after upgrading torch and torch-xla
Kaggle VM just silently dies after upgrading torch and torch-xla
!pip uninstall -y tensorflow
!pip install tensorflow-cpu #optional
It helped me get a little further with 2.2.0. But still,
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[9], line 42
34 fsdp_config = {"fsdp_transformer_layer_cls_to_wrap": [
35 "GemmaDecoderLayer"
36 ],
37 "xla": True,
38 "xla_fsdp_v2": True,
39 "xla_fsdp_grad_ckpt": True}
41 # Finally, set up the trainer and train the model.
---> 42 trainer = SFTTrainer(
43 model=model,
44 train_dataset=data,
45 args=TrainingArguments(
46 per_device_train_batch_size=64, # This is actually the global batch size for SPMD.
47 num_train_epochs=100,
48 max_steps=-1,
49 output_dir="./output",
50 optim="adafactor",
51 logging_steps=1,
52 dataloader_drop_last = True, # Required for SPMD.
53 fsdp="full_shard",
54 fsdp_config=fsdp_config,
55 ),
56 peft_config=lora_config,
57 dataset_text_field="quote",
58 max_seq_length=max_seq_length,
59 packing=True,
60 )
62 trainer.train()
File /usr/local/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:299, in SFTTrainer.__init__(self, model, args, data_collator, train_dataset, eval_dataset, tokenizer, model_init, compute_metrics, callbacks, optimizers, preprocess_logits_for_metrics, peft_config, dataset_text_field, packing, formatting_func, max_seq_length, infinite, num_of_sequences, chars_per_token, dataset_num_proc, dataset_batch_size, neftune_noise_alpha, model_init_kwargs, dataset_kwargs)
293 if tokenizer.padding_side is not None and tokenizer.padding_side != "right":
294 warnings.warn(
295 "You passed a tokenizer with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to "
296 "overflow issues when training a model in half-precision. You might consider adding `tokenizer.padding_side = 'right'` to your code."
297 )
--> 299 super().__init__(
300 model=model,
301 args=args,
302 data_collator=data_collator,
303 train_dataset=train_dataset,
304 eval_dataset=eval_dataset,
305 tokenizer=tokenizer,
306 model_init=model_init,
307 compute_metrics=compute_metrics,
308 callbacks=callbacks,
309 optimizers=optimizers,
310 preprocess_logits_for_metrics=preprocess_logits_for_metrics,
311 )
313 # Add tags for models that have been loaded with the correct transformers version
314 if hasattr(self.model, "add_model_tags"):
File /usr/local/lib/python3.10/site-packages/transformers/trainer.py:653, in Trainer.__init__(self, model, args, data_collator, train_dataset, eval_dataset, tokenizer, model_init, compute_metrics, callbacks, optimizers, preprocess_logits_for_metrics)
649 if self.is_fsdp_xla_v2_enabled:
650 # Prepare the SPMD mesh that is going to be used by the data loader and the FSDPv2 wrapper.
651 # Tensor axis is just a placeholder where it will not be used in FSDPv2.
652 num_devices = xr.global_runtime_device_count()
--> 653 xs.set_global_mesh(xs.Mesh(np.array(range(num_devices)), (num_devices, 1), axis_names=("fsdp", "tensor")))
AttributeError: module 'torch_xla.distributed.spmd' has no attribute 'set_global_mesh'
What's the right way to install nightly? I searched around but couldn't find it.
@windmaple Here is the instructions to install nightly: https://github.com/pytorch/xla#available-docker-images-and-wheels
I had the same problem as @windmaple:
AttributeError: module 'torch_xla.distributed.spmd' has no attribute 'set_global_mesh'
As @alanwaketan suggested I installed nightly build of xla in fresh conda env with specified packages.
conda create -n v_xla python=3.10
conda activate v_xla
pip install https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-nightly-cp310-cp310-linux_x86_64.whl
pip install datasets peft transformers trl
python train.py
Where train.py is this script https://huggingface.co/google/gemma-7b/blob/main/examples/example_fsdp.py
Running this script results in the following error:
Traceback (most recent call last):
File "/home/me/finetune/train.py", line 5, in <module>
import torch_xla
File "/home/me/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch_xla/__init__.py", line 7, in <module>
import _XLAC
ImportError: /home/me/miniconda3/envs/v_xla/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104impl3cow23materialize_cow_storageERNS_11StorageImplE
I am looking for workarounds.
@PawKanarek I'm stuck here too.
To resolve this problem
ImportError: /home/me/miniconda3/envs/v_xla/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104impl3cow23materialize_cow_storageERNS_11StorageImplE
You have to update pytorch to nightly
conda install pytorch-nightly::pytorch
But after this i got new problem
File "/home/me/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch_xla/runtime.py", line 124, in xla_device
return torch.device(torch_xla._XLAC._xla_get_default_device())
RuntimeError: Bad StatusOr access: INTERNAL: Failed to get global TPU topology.
I found similar issues: https://github.com/google/gemma_pytorch/issues/25, https://github.com/Lightning-AI/pytorch-lightning/issues/18932
@PawKanarek What's your libtpu version?
@windmaple Yea, usually you just need nightly for both pytorch and pytorch/xla. pytorch/xla heavily depends on pytorch.
@alanwaketan I think that my libtpu version is tpu-vm-pt-2.0, this is based on the command that I used to create my TPU v4-8.
gcloud compute tpus tpu-vm create my-tpu-name --zone=us-central2-b --accelerator-type=v4-8 --version=tpu-vm-pt-2.0
Oh, I see on documentation https://cloud.google.com/tpu/docs/supported-tpu-configurations#tpu_v4 that I should use tpu-vm-v4-pt-2.0. Thanks for the insight. ;)
@PawKanarek libtpu is a pip pkg, you can grep it from pip list.
The latest version is:
pip list | grep libtpu
libtpu-nightly 0.1.dev20240213
If yours is older than this, you can update it via:
pip install torch-xla[tpuvm]
I've installed this package
libtpu-nightly 0.1.dev20240213
and I still have the same
File "/home/me/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch_xla/runtime.py", line 124, in xla_device
return torch.device(torch_xla._XLAC._xla_get_default_device())
RuntimeError: Bad StatusOr access: INTERNAL: Failed to get global TPU topology.
@PawKanarek Could be a hardware issue then... Can you try recreate a new TPU vm?
tpu-vm-v4-pt-2.0 is a bit old image, do you mind following https://cloud.google.com/tpu/docs/run-calculation-pytorch to use vm version tpu-ubuntu2204-base. If the framrwork and libtpu version matched and it still doesn't work, it is usually usually the hardware issue or driver issue.
I created new machine with command
gcloud compute tpus tpu-vm create my-name --zone=us-central2-b --accelerator-type=v4-8 --version=tpu-ubuntu2204-base
installed all required packages on and now when i try to run this script https://huggingface.co/google/gemma-7b/blob/main/examples/example_fsdp.py I got this error:
(v_xla) me@tpu-1:~/finetune$ python train.py
Aborted (core dumped)
I will look for more specific errors :)
I managed to read the core dump file with gdb tool, but sadly I cannot find any specific errors. That's what gdbtool is showing me:
-bt: Display the stack trace of the current thread
(gdb) bt
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140269997869056) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=140269997869056) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=140269997869056, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3 0x00007f9327042476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x00007f93270287f3 in __GI_abort () at ./stdlib/abort.c:79
#5 0x00007f932765c38a in _Unwind_Resume (exc=0x5e5c200) at /opt/conda/conda-bld/gcc-compiler_1654084175708/work/gcc/libgcc/unwind.inc:245
#6 0x00007f93270298d5 in __pthread_cleanup_combined_routine (__frame=<optimized out>) at ../sysdeps/nptl/pthreadP.h:609
#7 __pthread_once_slow (once_control=<optimized out>, init_routine=0x7f9326cdac90 <std::__once_proxy()>) at ./nptl/pthread_once.c:114
#8 0x0000000000000000 in ?? ()
bt full: Display the full stack trace
(gdb) bt full
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140269997869056) at ./nptl/pthread_kill.c:44
tid = <optimized out>
ret = 0
pd = 0x7f9327654800
old_mask = {__val = {18446744073709551615, 140724683535936, 18446744073709551615, 18446744073709551615, 0, 10641313998539494912, 0, 140269997957756,
140269994049648, 140724683541272, 0, 0, 0, 0, 0, 0}}
ret = <optimized out>
pd = <optimized out>
old_mask = <optimized out>
ret = <optimized out>
tid = <optimized out>
ret = <optimized out>
resultvar = <optimized out>
resultvar = <optimized out>
__arg3 = <optimized out>
__arg2 = <optimized out>
__arg1 = <optimized out>
_a3 = <optimized out>
_a2 = <optimized out>
_a1 = <optimized out>
__futex = <optimized out>
resultvar = <optimized out>
__arg3 = <optimized out>
__arg2 = <optimized out>
__arg1 = <optimized out>
_a3 = <optimized out>
_a2 = <optimized out>
_a1 = <optimized out>
__futex = <optimized out>
__private = <optimized out>
__oldval = <optimized out>
result = <optimized out>
#1 __pthread_kill_internal (signo=6, threadid=140269997869056) at ./nptl/pthread_kill.c:78
No locals.
#2 __GI___pthread_kill (threadid=140269997869056, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
No locals.
#3 0x00007f9327042476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
ret = <optimized out>
#4 0x00007f93270287f3 in __GI_abort () at ./stdlib/abort.c:79
save_stage = 1
act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {0 <repeats 15 times>, 130843}}, sa_flags = 651013264,
sa_restorer = 0x7ffd04c60d40}
sigs = {__val = {32, 0 <repeats 15 times>}}
#5 0x00007f932765c38a in _Unwind_Resume (exc=0x5e5c200) at /opt/conda/conda-bld/gcc-compiler_1654084175708/work/gcc/libgcc/unwind.inc:245
this_context = {reg = {0x7ffd04c60d08, 0x7ffd04c60d10, 0x0, 0x7ffd04c60d18, 0x0, 0x0, 0x7ffd04c60d40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7ffd04c60d20,
--Type <RET> for more, q to quit, c to continue without paging--c
0x7ffd04c60d28, 0x7ffd04c60d30, 0x7ffd04c60d38, 0x7ffd04c60d48, 0x0}, cfa = 0x7ffd04c60d50, ra = 0x7f93270298d5 <obstack_free[cold]>, lsda = 0x0, bases = {tbase = 0x0, dbase = 0x0, func = 0x7f932766aaf0 <_Unwind_Resume>}, flags = 4611686018427387904, version = 0, args_size = 0, by_value = '\000' <repeats 17 times>}
cur_context = {reg = {0x7ffd04c60d08, 0x7ffd04c60d10, 0x0, 0x7ffd04c60d90, 0x0, 0x0, 0x7ffd04c60d98, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7ffd04c60da0, 0x7ffd04c60d28, 0x7ffd04c60d30, 0x7ffd04c60d38, 0x7ffd04c60da8, 0x0}, cfa = 0x7ffd04c60db0, ra = 0x0, lsda = 0x0, bases = {tbase = 0x0, dbase = 0x0, func = 0x7f93270298ac <__pthread_once_slow.cold>}, flags = 4611686018427387904, version = 0, args_size = 0, by_value = '\000' <repeats 17 times>}
code = <optimized out>
frames = 140724683541616
#6 0x00007f93270298d5 in __pthread_cleanup_combined_routine (__frame=<optimized out>) at ../sysdeps/nptl/pthreadP.h:609
No locals.
#7 __pthread_once_slow (once_control=<optimized out>, init_routine=0x7f9326cdac90 <std::__once_proxy()>) at ./nptl/pthread_once.c:114
__cancel_routine = 0x7f9327099f40 <clear_once_control>
__clframe = {__cancel_routine = 0x7f9327099f40 <clear_once_control>, __cancel_arg = 0x7f926e5a6be8 <torch_xla::InitXlaBackend()::register_key_flag>, __do_it = 0, __buffer = {__routine = 0x0, __arg = 0x0, __canceltype = 0, __prev = 0x0}}
val = <optimized out>
newval = <optimized out>
#8 0x0000000000000000 in ?? ()
No symbol table info available.
info threads- List all threads.
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7f9327654800 (LWP 80077) __pthread_kill_implementation (no_tid=0, signo=6, threadid=140269997869056) at ./nptl/pthread_kill.c:44
2 Thread 0x7f930dbfe640 (LWP 80079) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f9310dcae60 <thread_status+224>) at ./nptl/futex-internal.c:57
3 Thread 0x7f93093fd640 (LWP 80080) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f9310dcaee0 <thread_status+352>) at ./nptl/futex-internal.c:57
4 Thread 0x7f927cbc4640 (LWP 80137) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f9310dccb60 <thread_status+7648>) at ./nptl/futex-internal.c:57
5 Thread 0x7f930e3ff640 (LWP 80078) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f9310dcade0 <thread_status+96>) at ./nptl/futex-internal.c:57
6 Thread 0x7f93013f9640 (LWP 80084) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f9310dcb0e0 <thread_status+864>) at ./nptl/futex-internal.c:57
7 Thread 0x7f92fc3f7640 (LWP 80086) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f9310dcb1e0 <thread_status+1120>) at ./nptl/futex-internal.c:57
8 Thread 0x7f92f9bf6640 (LWP 80087) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f9310dcb260 <thread_status+1248>) at ./nptl/futex-internal.c:57
9 Thread 0x7f92f4bf4640 (LWP 80089) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f9310dcb360 <thread_status+1504>) at ./nptl/futex-internal.c:57
10 Thread 0x7f92753c1640 (LWP 80140) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f9310dccce0 <thread_status+8032>) at ./nptl/futex-internal.c:57
11 Thread 0x7f92efbf2640 (LWP 80091) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f9310dcb460 <thread_status+1760>) at ./nptl/futex-internal.c:57
12 Thread 0x7f92febf8640 (LWP 80085) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f9310dcb160 <thread_status+992>) at ./nptl/futex-internal.c:57
13 Thread 0x7f92e5bee640 (LWP 80095) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f9310dcb660 <thread_status+2272>) at ./nptl/futex-internal.c:57
14 Thread 0x7f92e0bec640 (LWP 80097) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f9310dcb760 <thread_status+2528>) at ./nptl/futex-internal.c:57
15 Thread 0x7f92dbbea640 (LWP 80099) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f9310dcb860 <thread_status+2784>) at ./nptl/futex-internal.c:57
16 Thread 0x7f92d6be8640 (LWP 80101) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f9310dcb960 <thread_status+3040>) at ./nptl/futex-internal.c:57
17 Thread 0x7f92d1be6640 (LWP 80103) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f9310dcba60 <thread_status+3296>) at ./nptl/futex-internal.c:57
18 Thread 0x7f92ccbe4640 (LWP 80105) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f9310dcbb60 <thread_status+3552>) at ./nptl/futex-internal.c:57
19 Thread 0x7f92cf3e5640 (LWP 80104) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f9310dcbae0 <thread_status+3424>) at ./nptl/futex-internal.c:57
20 Thread 0x7f92ca3e3640 (LWP 80106) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f9310dcbbe0 <thread_status+3680>) at ./nptl/futex-internal.c:57
21 Thread 0x7f92c7be2640 (LWP 80107) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f9310dcbc60 <thread_status+3808>) at ./nptl/futex-internal.c:57
22 Thread 0x7f92c2be0640 (LWP 80109) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f9310dcbd60 <thread_status+4064>) at ./nptl/futex-internal.c:57
23 Thread 0x7f92c03df640 (LWP 80110) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
--Type <RET> for more, q to quit, c to continue without paging--c
futex_word=0x7f9310dcbde0 <thread_status+4192>) at ./nptl/futex-internal.c:57
24 Thread 0x7f92b8bdc640 (LWP 80113) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcbf60 <thread_status+4576>) at ./nptl/futex-internal.c:57
25 Thread 0x7f92bdbde640 (LWP 80111) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcbe60 <thread_status+4320>) at ./nptl/futex-internal.c:57
26 Thread 0x7f92bb3dd640 (LWP 80112) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcbee0 <thread_status+4448>) at ./nptl/futex-internal.c:57
27 Thread 0x7f92b63db640 (LWP 80114) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcbfe0 <thread_status+4704>) at ./nptl/futex-internal.c:57
28 Thread 0x7f92b3bda640 (LWP 80115) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc060 <thread_status+4832>) at ./nptl/futex-internal.c:57
29 Thread 0x7f92ac3d7640 (LWP 80118) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc1e0 <thread_status+5216>) at ./nptl/futex-internal.c:57
30 Thread 0x7f92a73d5640 (LWP 80120) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc2e0 <thread_status+5472>) at ./nptl/futex-internal.c:57
31 Thread 0x7f92a4bd4640 (LWP 80121) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc360 <thread_status+5600>) at ./nptl/futex-internal.c:57
32 Thread 0x7f92a9bd6640 (LWP 80119) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc260 <thread_status+5344>) at ./nptl/futex-internal.c:57
33 Thread 0x7f929fbd2640 (LWP 80123) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc460 <thread_status+5856>) at ./nptl/futex-internal.c:57
34 Thread 0x7f92aebd8640 (LWP 80117) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc160 <thread_status+5088>) at ./nptl/futex-internal.c:57
35 Thread 0x7f92a23d3640 (LWP 80122) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc3e0 <thread_status+5728>) at ./nptl/futex-internal.c:57
36 Thread 0x7f929abd0640 (LWP 80125) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc560 <thread_status+6112>) at ./nptl/futex-internal.c:57
37 Thread 0x7f92983cf640 (LWP 80126) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc5e0 <thread_status+6240>) at ./nptl/futex-internal.c:57
38 Thread 0x7f929d3d1640 (LWP 80124) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc4e0 <thread_status+5984>) at ./nptl/futex-internal.c:57
39 Thread 0x7f9295bce640 (LWP 80127) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc660 <thread_status+6368>) at ./nptl/futex-internal.c:57
40 Thread 0x7f92933cd640 (LWP 80128) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc6e0 <thread_status+6496>) at ./nptl/futex-internal.c:57
41 Thread 0x7f9290bcc640 (LWP 80129) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc760 <thread_status+6624>) at ./nptl/futex-internal.c:57
42 Thread 0x7f928e3cb640 (LWP 80130) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc7e0 <thread_status+6752>) at ./nptl/futex-internal.c:57
43 Thread 0x7f928bbca640 (LWP 80131) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc860 <thread_status+6880>) at ./nptl/futex-internal.c:57
44 Thread 0x7f9286bc8640 (LWP 80133) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc960 <thread_status+7136>) at ./nptl/futex-internal.c:57
45 Thread 0x7f92843c7640 (LWP 80134) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc9e0 <thread_status+7264>) at ./nptl/futex-internal.c:57
46 Thread 0x7f9281bc6640 (LWP 80135) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcca60 <thread_status+7392>) at ./nptl/futex-internal.c:57
47 Thread 0x7f927a3c3640 (LWP 80138) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dccbe0 <thread_status+7776>) at ./nptl/futex-internal.c:57
48 Thread 0x7f92893c9640 (LWP 80132) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc8e0 <thread_status+7008>) at ./nptl/futex-internal.c:57
49 Thread 0x7f927f3c5640 (LWP 80136) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dccae0 <thread_status+7520>) at ./nptl/futex-internal.c:57
50 Thread 0x7f9308bfc640 (LWP 80081) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcaf60 <thread_status+480>) at ./nptl/futex-internal.c:57
51 Thread 0x7f93063fb640 (LWP 80082) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcafe0 <thread_status+608>) at ./nptl/futex-internal.c:57
52 Thread 0x7f9277bc2640 (LWP 80139) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dccc60 <thread_status+7904>) at ./nptl/futex-internal.c:57
53 Thread 0x7f9301bfa640 (LWP 80083) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb060 <thread_status+736>) at ./nptl/futex-internal.c:57
54 Thread 0x7f92f23f3640 (LWP 80090) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb3e0 <thread_status+1632>) at ./nptl/futex-internal.c:57
55 Thread 0x7f92f73f5640 (LWP 80088) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb2e0 <thread_status+1376>) at ./nptl/futex-internal.c:57
56 Thread 0x7f92ed3f1640 (LWP 80092) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb4e0 <thread_status+1888>) at ./nptl/futex-internal.c:57
57 Thread 0x7f92eabf0640 (LWP 80093) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb560 <thread_status+2016>) at ./nptl/futex-internal.c:57
58 Thread 0x7f92e33ed640 (LWP 80096) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb6e0 <thread_status+2400>) at ./nptl/futex-internal.c:57
59 Thread 0x7f92e83ef640 (LWP 80094) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb5e0 <thread_status+2144>) at ./nptl/futex-internal.c:57
60 Thread 0x7f92de3eb640 (LWP 80098) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb7e0 <thread_status+2656>) at ./nptl/futex-internal.c:57
61 Thread 0x7f92d93e9640 (LWP 80100) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb8e0 <thread_status+2912>) at ./nptl/futex-internal.c:57
62 Thread 0x7f92d43e7640 (LWP 80102) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb9e0 <thread_status+3168>) at ./nptl/futex-internal.c:57
63 Thread 0x7f92c53e1640 (LWP 80108) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcbce0 <thread_status+3936>) at ./nptl/futex-internal.c:57
64 Thread 0x7f92b13d9640 (LWP 80116) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc0e0 <thread_status+4960>) at ./nptl/futex-internal.c:57
-list: Show the source code (if available) around the current line.
(gdb) list
39 in ./nptl/pthread_kill.c
info sharedlibrary: list shared libraries loaded by the program at the time of the crash.
(gdb) info sharedlibrary
From To Syms Read Shared Object Library
0x00007f9327413e00 0x00007f93274353c3 Yes (*) /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4
0x00007f93277b2040 0x00007f93277b2105 Yes /lib/x86_64-linux-gnu/libpthread.so.0
0x00007f93277ad040 0x00007f93277ad105 Yes /lib/x86_64-linux-gnu/libdl.so.2
0x00007f93277a8040 0x00007f93277a8105 Yes /lib/x86_64-linux-gnu/libutil.so.1
0x00007f93276ce3a0 0x00007f93277498c8 Yes /lib/x86_64-linux-gnu/libm.so.6
0x00007f9327028700 0x00007f93271ba93d Yes /lib/x86_64-linux-gnu/libc.so.6
0x00007f93276a5280 0x00007f93276ae5bf Yes (*) /lib/x86_64-linux-gnu/libunwind.so.8
0x00007f9326ca5150 0x00007f9326d95b31 Yes /home/raix/miniconda3/envs/v_xla/bin/../lib/libstdc++.so.6
0x00007f93277c0090 0x00007f93277e9315 Yes /lib64/ld-linux-x86-64.so.2
0x00007f9327677050 0x00007f9327693c51 Yes (*) /home/raix/miniconda3/envs/v_xla/bin/../lib/liblzma.so.5
0x00007f932765c320 0x00007f932766d6e1 Yes /home/raix/miniconda3/envs/v_xla/bin/../lib/libgcc_s.so.1
0x00007f932763c050 0x00007f9327643411 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/math.cpython-310-x86_64-linux-gnu.so
0x00007f9327632050 0x00007f9327633081 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/fcntl.cpython-310-x86_64-linux-gnu.so
0x00007f932762b050 0x00007f932762cf71 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_posixsubprocess.cpython-310-x86_64-linux-gnu.so
0x00007f9327621050 0x00007f93276231c1 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/select.cpython-310-x86_64-linux-gnu.so
0x00007f9327290050 0x00007f932729d7d1 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so
0x00007f9327611000 0x00007f9327619791 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/../../libffi.so.8
0x00007f932727e050 0x00007f9327282a01 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_struct.cpython-310-x86_64-linux-gnu.so
0x00007f93277b7050 0x00007f93277b7391 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_opcode.cpython-310-x86_64-linux-gnu.so
0x00007f9327604050 0x00007f9327607251 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/zlib.cpython-310-x86_64-linux-gnu.so
0x00007f932725f050 0x00007f9327270241 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/../../libz.so.1
0x00007f9327256050 0x00007f9327257de1 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_bz2.cpython-310-x86_64-linux-gnu.so
0x00007f9327242050 0x00007f932724f431 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/../../libbz2.so.1.0
0x00007f9327237050 0x00007f932723a8f1 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_lzma.cpython-310-x86_64-linux-gnu.so
0x00007f932722f050 0x00007f9327230031 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_bisect.cpython-310-x86_64-linux-gnu.so
0x00007f9326efb050 0x00007f9326efcbb1 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_random.cpython-310-x86_64-linux-gnu.so
0x00007f9326ef1050 0x00007f9326ef5bf1 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_sha512.cpython-310-x86_64-linux-gnu.so
0x00007f9326eeb050 0x00007f9326eeb105 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/libtorch_global_deps.so
0x00007f9325458390 0x00007f9325f531c0 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/../../../../libmkl_intel_lp64.so
0x00007f9323602bf0 0x00007f9324da8feb Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/../../../../libmkl_gnu_thread.so
0x00007f931f01ab00 0x00007f9322713b80 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/../../../../libmkl_core.so
0x00007f9326eb1730 0x00007f9326edbec1 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/../../../../libgomp.so.1
0x00007f9326ea2050 0x00007f9326ea2115 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/_C.cpython-310-x86_64-linux-gnu.so
0x00007f931de21b40 0x00007f931e9432b8 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/libtorch_python.so
0x00007f9326e9a440 0x00007f9326e9c5d3 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/libshm.so
0x00007f9326e81890 0x00007f9326e8dc90 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/libtorch.so
0x00007f93125311c0 0x00007f931b914530 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
0x00007f9326523270 0x00007f93265accb4 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/libc10.so
0x00007f9326e66080 0x00007f9326e66275 Yes /lib/x86_64-linux-gnu/librt.so.1
0x00007f931102da70 0x00007f9311508663 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so
0x00007f930ef18000 0x00007f9310b98c5c Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so
--Type <RET> for more, q to quit, c to continue without paging--c
0x00007f930e81b870 0x00007f930ea46837 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/core/../../numpy.libs/libgfortran-040039e1.so.5.0.0
0x00007f930e4023e0 0x00007f930e425d2b Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/core/../../numpy.libs/libquadmath-96973f99.so.0.0.0
0x00007f9326e4b050 0x00007f9326e5b571 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_datetime.cpython-310-x86_64-linux-gnu.so
0x00007f9326e2a050 0x00007f9326e3bf61 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_pickle.cpython-310-x86_64-linux-gnu.so
0x00007f9326e6c050 0x00007f9326e6c211 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_contextvars.cpython-310-x86_64-linux-gnu.so
0x00007f93250e0e70 0x00007f93250f6299 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/core/_multiarray_tests.cpython-310-x86_64-linux-gnu.so
0x00007f93250aec20 0x00007f93250cdda2 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/linalg/_umath_linalg.cpython-310-x86_64-linux-gnu.so
0x00007f9325091170 0x00007f93250a3adf Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/fft/_pocketfft_internal.cpython-310-x86_64-linux-gnu.so
0x00007f930ed528f0 0x00007f930edaafd4 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/random/mtrand.cpython-310-x86_64-linux-gnu.so
0x00007f931edcf8f0 0x00007f931edf12cf Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/random/bit_generator.cpython-310-x86_64-linux-gnu.so
0x00007f931ed90830 0x00007f931edc12ab Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/random/_common.cpython-310-x86_64-linux-gnu.so
0x00007f9326e1b050 0x00007f9326e1eff1 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/binascii.cpython-310-x86_64-linux-gnu.so
0x00007f9326af3050 0x00007f9326af8101 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_hashlib.cpython-310-x86_64-linux-gnu.so
0x00007f92706b9000 0x00007f927090412f Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/../../libcrypto.so.3
0x00007f9325084050 0x00007f932508b551 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_blake2.cpython-310-x86_64-linux-gnu.so
0x00007f931dbaf840 0x00007f931dbf3fdf Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/random/_bounded_integers.cpython-310-x86_64-linux-gnu.so
0x00007f93232e85e0 0x00007f93232f86d2 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/random/_mt19937.cpython-310-x86_64-linux-gnu.so
0x00007f931ed77610 0x00007f931ed8502c Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/random/_philox.cpython-310-x86_64-linux-gnu.so
0x00007f931db92600 0x00007f931dba35eb Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/random/_pcg64.cpython-310-x86_64-linux-gnu.so
0x00007f93232d7590 0x00007f93232df1ba Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/random/_sfc64.cpython-310-x86_64-linux-gnu.so
0x00007f930e727c90 0x00007f930e7a6f67 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/random/_generator.cpython-310-x86_64-linux-gnu.so
0x00007f93116f8050 0x00007f93116faa21 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_heapq.cpython-310-x86_64-linux-gnu.so
0x00007f9326aeb050 0x00007f9326aeba21 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/grp.cpython-310-x86_64-linux-gnu.so
0x00007f93116ed050 0x00007f93116f2eb1 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_json.cpython-310-x86_64-linux-gnu.so
0x00007f93116d9050 0x00007f93116e2f81 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/cmath.cpython-310-x86_64-linux-gnu.so
0x00007f9310eec050 0x00007f9310ef41c1 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_socket.cpython-310-x86_64-linux-gnu.so
0x00007f9310ed9050 0x00007f9310edfa81 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/array.cpython-310-x86_64-linux-gnu.so
0x00007f931db8b050 0x00007f931db8be21 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_multiprocessing.cpython-310-x86_64-linux-gnu.so
0x00007f9326e15050 0x00007f9326e15231 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_uuid.cpython-310-x86_64-linux-gnu.so
0x00007f93116d0050 0x00007f93116d35c1 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/../../libuuid.so.1
0x00007f9310eb3050 0x00007f9310ebc351 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_ssl.cpython-310-x86_64-linux-gnu.so
0x00007f930ecca050 0x00007f930ed1f993 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/../../libssl.so.3
0x00007f926f8f1050 0x00007f926f8f4461 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/unicodedata.cpython-310-x86_64-linux-gnu.so
0x00007f9310e9c050 0x00007f9310e9ca01 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_queue.cpython-310-x86_64-linux-gnu.so
0x00007f9310e8e050 0x00007f9310e92581 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_asyncio.cpython-310-x86_64-linux-gnu.so
0x00007f92637c5480 0x00007f926c0cf9ef Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
0x00007f925ee5c050 0x00007f925f066fc1 Yes /home/raix/miniconda3/envs/v_xla/lib/libpython3.10.so.1.0
0x00007f930e6e8040 0x00007f930e6fb97b Yes (*) /lib/x86_64-linux-gnu/libcrypt.so.1
0x00007f9310e85090 0x00007f9310e851b5 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/charset_normalizer/md.cpython-310-x86_64-linux-gnu.so
0x00007f930e6bb280 0x00007f930e6d5e05 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/charset_normalizer/md__mypyc.cpython-310-x86_64-linux-gnu.so
0x00007f930eca0050 0x00007f930eca56f1 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_multibytecodec.cpython-310-x86_64-linux-gnu.so
0x00007f930e67d050 0x00007f930e6a36e1 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/yaml/_yaml.cpython-310-x86_64-linux-gnu.so
0x00007f930e65a040 0x00007f930e66ff45 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/yaml/../../../libyaml-0.so.2
0x00007f926e674050 0x00007f926e6c5421 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/gmpy2/gmpy2.cpython-310-x86_64-linux-gnu.so
0x00007f925e804a30 0x00007f925e814b53 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/gmpy2/../../../libmpc.so.3
0x00007f9270a4f040 0x00007f9270aaba57 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/gmpy2/../../../libmpfr.so.6
0x00007f925eb6d080 0x00007f925ebe4326 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/gmpy2/../../../libgmp.so.10
0x00007f925f1ba050 0x00007f925f1ebbf1 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_decimal.cpython-310-x86_64-linux-gnu.so
0x00007f930ec97050 0x00007f930ec97ef1 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/termios.cpython-310-x86_64-linux-gnu.so
0x00007f930e652050 0x00007f930e653f41 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_lsprof.cpython-310-x86_64-linux-gnu.so
0x00007f925d146090 0x00007f925d1d12ef Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/safetensors/_safetensors_rust.cpython-310-x86_64-linux-gnu.so
0x00007f930e646050 0x00007f930e64a181 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_csv.cpython-310-x86_64-linux-gnu.so
(*): Shared library is missing debugging information.
@JackCaoG Now i created the v4-8 machine with this vm version: tpu-vm-v4-pt-2.0
gcloud compute tpus tpu-vm create myname --zone=us-central2-b --accelerator-type=v4-8 --version=tpu-vm-v4-pt-2.0
And now Im getting different message, but at least it's now readable :)
python server/server.py
WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
WARNING:root:libtpu.so and TPU device found. Setting PJRT_DEVICE=TPU.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1710013726.247688 30296 pjrt_api.cc:100] GetPjrtApi was found for tpu at /home/me/miniconda3/envs/tpu_v4/lib/python3.10/site-packages/torch_xla/lib/libtpu.so
I0000 00:00:1710013726.247769 30296 pjrt_api.cc:79] PJRT_Api is set for device type tpu
I0000 00:00:1710013726.247774 30296 pjrt_api.cc:146] The PJRT plugin has PJRT API version 0.40. The framework PJRT API version is 0.40.
/home/me/miniconda3/envs/tpu_v4/lib/python3.10/site-packages/torch_xla/runtime.py:247: UserWarning: Replicating tensors already initialized on non-virtual XLA device for SPMD to force SPMD mode. This is one-time overhead to setup, and to minimize such, please set SPMD mode before initializting tensors (i.e., call use_spmd() in the beginning of the program).
warnings.warn(
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 3.07it/s]
/home/me/miniconda3/envs/tpu_v4/lib/python3.10/site-packages/transformers/training_args.py:1815: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_token` instead.
warnings.warn(
/home/me/miniconda3/envs/tpu_v4/lib/python3.10/site-packages/transformers/training_args.py:1827: FutureWarning: `--push_to_hub_model_id` and `--push_to_hub_organization` are deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_model_id` instead and pass the full repo name to this argument (in this case google/gemma-2-it).
warnings.warn(
https://symbolize.stripped_domain/r/?trace=7f8ddd4d4953,7f8ea111e3bf,7f8de5b4364d,7f8dde56762d,7f8de5b46273,7f8dddde807a,7f8dddbdb4ea,7f8e90515509&map=
*** SIGSEGV (@0x1d8), see go/stacktraces#s15 received by PID 30296 (TID 31841) on cpu 195; stack trace: ***
PC: @ 0x7f8ddd4d4953 (unknown) torch_xla::runtime::PjRtComputationClient::ExecuteReplicated()::{lambda()#1}::operator()()
@ 0x7f8d6c18c6a7 928 (unknown)
@ 0x7f8ea111e3c0 1984 (unknown)
@ 0x7f8de5b4364e 32 std::_Function_handler<>::_M_invoke()
@ 0x7f8dde56762e 288 Eigen::ThreadPoolDevice::parallelFor()
@ 0x7f8de5b46274 576 tsl::thread::ThreadPool::ParallelFor()
@ 0x7f8dddde807b 1168 torch_xla::runtime::PjRtComputationClient::ExecuteReplicated()
@ 0x7f8dddbdb4eb 624 torch_xla::XLAGraphExecutor::ScheduleSyncTensorsGraph()::{lambda()#1}::operator()()
@ 0x7f8e9051550a (unknown) torch::lazy::MultiWait::Complete()
@ ... and at least 1 more frames
https://symbolize.stripped_domain/r/?trace=7f8ddd4d4953,7f8d6c18c6a6,7f8ea111e3bf,7f8de5b4364d,7f8dde56762d,7f8de5b46273,7f8dddde807a,7f8dddbdb4ea,7f8e90515509&map=
E0309 19:48:53.091365 31841 coredump_hook.cc:442] RAW: Remote crash data gathering hook invoked.
E0309 19:48:53.091373 31841 coredump_hook.cc:481] RAW: Skipping coredump since rlimit was 0 at process start.
E0309 19:48:53.091379 31841 client.cc:269] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0309 19:48:53.091381 31841 coredump_hook.cc:537] RAW: Sending fingerprint to remote end.
E0309 19:48:53.091395 31841 coredump_hook.cc:546] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0309 19:48:53.091399 31841 coredump_hook.cc:598] RAW: Dumping core locally.
E0309 19:48:53.337414 31841 process_state.cc:807] RAW: Raising signal 11 with default behavior
Segmentation fault (core dumped)
Can you follow https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#sanity-check to run a resnet with fakedata? I am not sure if it is a env setup issue or gemma issue in your case.
Thanks for advice, sanity check looks good on this tpu imports:
python
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import torch_xla
>>> print(torch.__version__)
2.3.0.dev20240309
>>> print(torch_xla.__version__)
2.3.0+git6043185
simple calculation
python3
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import torch_xla.core.xla_model as xm
>>> t1 = torch.tensor(100, device=xm.xla_device())
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1710270930.199793 326792 pjrt_api.cc:100] GetPjrtApi was found for tpu at /home/raix/miniconda3/envs/tpu_v4/lib/python3.10/site-packages/torch_xla/lib/libtpu.so
I0000 00:00:1710270930.199885 326792 pjrt_api.cc:79] PJRT_Api is set for device type tpu
I0000 00:00:1710270930.199890 326792 pjrt_api.cc:146] The PJRT plugin has PJRT API version 0.40. The framework PJRT API version is 0.40.
>>> t2 = torch.tensor(200, device=xm.xla_device())
>>> print(t1 + t2)
tensor(300, device='xla:0')
>>>
imagenet
Epoch 18 train end 20:13:54
| Test Device=xla:0/0 Step=0 Epoch=18 Time=20:13:54
| Test Device=xla:0/1 Step=0 Epoch=18 Time=20:13:54
| Test Device=xla:0/3 Step=0 Epoch=18 Time=20:13:54
| Test Device=xla:0/2 Step=0 Epoch=18 Time=20:13:54
| Test Device=xla:0/3 Step=20 Epoch=18 Time=20:13:54
| Test Device=xla:0/2 Step=20 Epoch=18 Time=20:13:54
| Test Device=xla:0/1 Step=20 Epoch=18 Time=20:13:54
| Test Device=xla:0/0 Step=20 Epoch=18 Time=20:13:54
| Test Device=xla:0/1 Step=40 Epoch=18 Time=20:13:54
| Test Device=xla:0/0 Step=40 Epoch=18 Time=20:13:54
| Test Device=xla:0/3 Step=40 Epoch=18 Time=20:13:54
| Test Device=xla:0/2 Step=40 Epoch=18 Time=20:13:54
| Test Device=xla:0/1 Step=60 Epoch=18 Time=20:13:55
| Test Device=xla:0/3 Step=60 Epoch=18 Time=20:13:55
| Test Device=xla:0/0 Step=60 Epoch=18 Time=20:13:55
| Test Device=xla:0/2 Step=60 Epoch=18 Time=20:13:55
| Test Device=xla:0/1 Step=80 Epoch=18 Time=20:13:55
| Test Device=xla:0/2 Step=80 Epoch=18 Time=20:13:55
| Test Device=xla:0/0 Step=80 Epoch=18 Time=20:13:55
| Test Device=xla:0/3 Step=80 Epoch=18 Time=20:13:55
Epoch 18 test end 20:13:55, Accuracy=100.00
Max Accuracy: 100.00%
@PawKanarek For Gemma, have you set the following env: PJRT_DEVICE=TPU XLA_USE_SPMD=1 ?
It seems that setting export PJRT_DEVICE=TPU and export XLA_USE_SPMD=1 resolved the issue. I was certain I had exported the variables... The training now works though it occasionally crashes during training on larger datasets. But no problems on smaller datasets. Thanks!
It seems that setting
export PJRT_DEVICE=TPUandexport XLA_USE_SPMD=1resolved the issue. I was certain I had exported the variables... The training now works though it occasionally crashes during training on larger datasets. But no problems on smaller datasets. Thanks!
I would love to learn more about the crash as well! Do you mind open a new bug?
@windmaple @PawKanarek Are we good to close this issue?
The problem with AttributeError: module 'torch_xla.distributed.spmd' has no attribute 'set_global_mesh' was resolved on my machine.