xla icon indicating copy to clipboard operation
xla copied to clipboard

Gemma finetuning on Kaggle TPU doesn't work

Open windmaple opened this issue 1 year ago • 24 comments

🐛 Bug

Not sure if this is a feature request or bug. I took the SPMD Gemma ft code from Hugging Face and tried to run it on Kaggle; it didn't work.

trl seems to have an issue there.

To Reproduce

See my Kaggle notebook.

Expected behavior

Ideally it should run.

Environment

  • Reproducible on XLA backend [CPU/TPU/CUDA]: TPU
  • torch_xla version:

Stock Kaggle env.

Additional context

windmaple avatar Feb 24 '24 07:02 windmaple

OK, seems that code is for Cloud TPU only as mentioned in this HF blog. Then this is a feature request.

windmaple avatar Feb 24 '24 07:02 windmaple

@alanwaketan

JackCaoG avatar Feb 26 '24 18:02 JackCaoG

🐛 Bug

Not sure if this is a feature request or bug. I took the SPMD Gemma ft code from Hugging Face and tried to run it on Kaggle; it didn't work.

trl seems to have an issue there.

To Reproduce

See my Kaggle notebook.

Expected behavior

Ideally it should run.

Environment

  • Reproducible on XLA backend [CPU/TPU/CUDA]: TPU
  • torch_xla version:

Stock Kaggle env.

Additional context

Kaggle is using Older version of torch-xla where distributed.spmd is not implemented

OK, seems that code is for Cloud TPU only as mentioned in this HF blog. Then this is a feature request.

kaggle is using older version of torch-xla where torch.distributed.spmd was not implemented , would recommend to upgrade torch-xla

!pip install torch~=2.2.0 torch_xla[tpu]~=2.2.0 -f https://storage.googleapis.com/libtpu-releases/index.html

IsNoobgrammer avatar Mar 02 '24 11:03 IsNoobgrammer

@windmaple You need to install the nightly torch-xla and torch.

alanwaketan avatar Mar 05 '24 01:03 alanwaketan

Kaggle VM just silently dies after upgrading torch and torch-xla

windmaple avatar Mar 05 '24 05:03 windmaple

Kaggle VM just silently dies after upgrading torch and torch-xla

!pip uninstall -y tensorflow
!pip install tensorflow-cpu #optional

IsNoobgrammer avatar Mar 05 '24 05:03 IsNoobgrammer

It helped me get a little further with 2.2.0. But still,

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[9], line 42
     34 fsdp_config = {"fsdp_transformer_layer_cls_to_wrap": [
     35         "GemmaDecoderLayer"
     36     ],
     37     "xla": True,
     38     "xla_fsdp_v2": True,
     39     "xla_fsdp_grad_ckpt": True}
     41 # Finally, set up the trainer and train the model.
---> 42 trainer = SFTTrainer(
     43     model=model,
     44     train_dataset=data,
     45     args=TrainingArguments(
     46         per_device_train_batch_size=64,  # This is actually the global batch size for SPMD.
     47         num_train_epochs=100,
     48         max_steps=-1,
     49         output_dir="./output",
     50         optim="adafactor",
     51         logging_steps=1,
     52         dataloader_drop_last = True,  # Required for SPMD.
     53         fsdp="full_shard",
     54         fsdp_config=fsdp_config,
     55     ),
     56     peft_config=lora_config,
     57     dataset_text_field="quote",
     58     max_seq_length=max_seq_length,
     59     packing=True,
     60 )
     62 trainer.train()

File /usr/local/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:299, in SFTTrainer.__init__(self, model, args, data_collator, train_dataset, eval_dataset, tokenizer, model_init, compute_metrics, callbacks, optimizers, preprocess_logits_for_metrics, peft_config, dataset_text_field, packing, formatting_func, max_seq_length, infinite, num_of_sequences, chars_per_token, dataset_num_proc, dataset_batch_size, neftune_noise_alpha, model_init_kwargs, dataset_kwargs)
    293 if tokenizer.padding_side is not None and tokenizer.padding_side != "right":
    294     warnings.warn(
    295         "You passed a tokenizer with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to "
    296         "overflow issues when training a model in half-precision. You might consider adding `tokenizer.padding_side = 'right'` to your code."
    297     )
--> 299 super().__init__(
    300     model=model,
    301     args=args,
    302     data_collator=data_collator,
    303     train_dataset=train_dataset,
    304     eval_dataset=eval_dataset,
    305     tokenizer=tokenizer,
    306     model_init=model_init,
    307     compute_metrics=compute_metrics,
    308     callbacks=callbacks,
    309     optimizers=optimizers,
    310     preprocess_logits_for_metrics=preprocess_logits_for_metrics,
    311 )
    313 # Add tags for models that have been loaded with the correct transformers version
    314 if hasattr(self.model, "add_model_tags"):

File /usr/local/lib/python3.10/site-packages/transformers/trainer.py:653, in Trainer.__init__(self, model, args, data_collator, train_dataset, eval_dataset, tokenizer, model_init, compute_metrics, callbacks, optimizers, preprocess_logits_for_metrics)
    649 if self.is_fsdp_xla_v2_enabled:
    650     # Prepare the SPMD mesh that is going to be used by the data loader and the FSDPv2 wrapper.
    651     # Tensor axis is just a placeholder where it will not be used in FSDPv2.
    652     num_devices = xr.global_runtime_device_count()
--> 653     xs.set_global_mesh(xs.Mesh(np.array(range(num_devices)), (num_devices, 1), axis_names=("fsdp", "tensor")))

AttributeError: module 'torch_xla.distributed.spmd' has no attribute 'set_global_mesh'

What's the right way to install nightly? I searched around but couldn't find it.

windmaple avatar Mar 05 '24 09:03 windmaple

@windmaple Here is the instructions to install nightly: https://github.com/pytorch/xla#available-docker-images-and-wheels

alanwaketan avatar Mar 05 '24 19:03 alanwaketan

I had the same problem as @windmaple:

AttributeError: module 'torch_xla.distributed.spmd' has no attribute 'set_global_mesh'

As @alanwaketan suggested I installed nightly build of xla in fresh conda env with specified packages.

conda create -n v_xla python=3.10
conda activate v_xla
pip install https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-nightly-cp310-cp310-linux_x86_64.whl
pip install datasets peft transformers trl
python train.py

Where train.py is this script https://huggingface.co/google/gemma-7b/blob/main/examples/example_fsdp.py

Running this script results in the following error:

Traceback (most recent call last):
  File "/home/me/finetune/train.py", line 5, in <module>
    import torch_xla
  File "/home/me/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch_xla/__init__.py", line 7, in <module>
    import _XLAC
ImportError: /home/me/miniconda3/envs/v_xla/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104impl3cow23materialize_cow_storageERNS_11StorageImplE

I am looking for workarounds.

PawKanarek avatar Mar 07 '24 12:03 PawKanarek

@PawKanarek I'm stuck here too.

windmaple avatar Mar 07 '24 12:03 windmaple

To resolve this problem

ImportError: /home/me/miniconda3/envs/v_xla/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104impl3cow23materialize_cow_storageERNS_11StorageImplE

You have to update pytorch to nightly

conda install pytorch-nightly::pytorch

But after this i got new problem

File "/home/me/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch_xla/runtime.py", line 124, in xla_device
    return torch.device(torch_xla._XLAC._xla_get_default_device())
RuntimeError: Bad StatusOr access: INTERNAL: Failed to get global TPU topology.

I found similar issues: https://github.com/google/gemma_pytorch/issues/25, https://github.com/Lightning-AI/pytorch-lightning/issues/18932

PawKanarek avatar Mar 07 '24 12:03 PawKanarek

@PawKanarek What's your libtpu version?

alanwaketan avatar Mar 07 '24 21:03 alanwaketan

@windmaple Yea, usually you just need nightly for both pytorch and pytorch/xla. pytorch/xla heavily depends on pytorch.

alanwaketan avatar Mar 07 '24 21:03 alanwaketan

@alanwaketan I think that my libtpu version is tpu-vm-pt-2.0, this is based on the command that I used to create my TPU v4-8.

gcloud compute tpus tpu-vm create my-tpu-name --zone=us-central2-b --accelerator-type=v4-8 --version=tpu-vm-pt-2.0

Oh, I see on documentation https://cloud.google.com/tpu/docs/supported-tpu-configurations#tpu_v4 that I should use tpu-vm-v4-pt-2.0. Thanks for the insight. ;)

PawKanarek avatar Mar 07 '24 22:03 PawKanarek

@PawKanarek libtpu is a pip pkg, you can grep it from pip list.

The latest version is:

pip list | grep libtpu
libtpu-nightly           0.1.dev20240213

If yours is older than this, you can update it via:

pip install torch-xla[tpuvm]

alanwaketan avatar Mar 07 '24 22:03 alanwaketan

I've installed this package

libtpu-nightly           0.1.dev20240213

and I still have the same

  File "/home/me/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch_xla/runtime.py", line 124, in xla_device
    return torch.device(torch_xla._XLAC._xla_get_default_device())
RuntimeError: Bad StatusOr access: INTERNAL: Failed to get global TPU topology.

PawKanarek avatar Mar 08 '24 22:03 PawKanarek

@PawKanarek Could be a hardware issue then... Can you try recreate a new TPU vm?

alanwaketan avatar Mar 08 '24 22:03 alanwaketan

tpu-vm-v4-pt-2.0 is a bit old image, do you mind following https://cloud.google.com/tpu/docs/run-calculation-pytorch to use vm version tpu-ubuntu2204-base. If the framrwork and libtpu version matched and it still doesn't work, it is usually usually the hardware issue or driver issue.

JackCaoG avatar Mar 08 '24 23:03 JackCaoG

I created new machine with command

 gcloud compute tpus tpu-vm create my-name --zone=us-central2-b --accelerator-type=v4-8 --version=tpu-ubuntu2204-base

installed all required packages on and now when i try to run this script https://huggingface.co/google/gemma-7b/blob/main/examples/example_fsdp.py I got this error:

(v_xla) me@tpu-1:~/finetune$ python train.py 
Aborted (core dumped)

I will look for more specific errors :)

PawKanarek avatar Mar 08 '24 23:03 PawKanarek

This might be irrelevant

I managed to read the core dump file with gdb tool, but sadly I cannot find any specific errors. That's what gdbtool is showing me: -bt: Display the stack trace of the current thread

(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140269997869056) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140269997869056) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140269997869056, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007f9327042476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007f93270287f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007f932765c38a in _Unwind_Resume (exc=0x5e5c200) at /opt/conda/conda-bld/gcc-compiler_1654084175708/work/gcc/libgcc/unwind.inc:245
#6  0x00007f93270298d5 in __pthread_cleanup_combined_routine (__frame=<optimized out>) at ../sysdeps/nptl/pthreadP.h:609
#7  __pthread_once_slow (once_control=<optimized out>, init_routine=0x7f9326cdac90 <std::__once_proxy()>) at ./nptl/pthread_once.c:114
#8  0x0000000000000000 in ?? ()

bt full: Display the full stack trace

(gdb) bt full
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140269997869056) at ./nptl/pthread_kill.c:44
        tid = <optimized out>
        ret = 0
        pd = 0x7f9327654800
        old_mask = {__val = {18446744073709551615, 140724683535936, 18446744073709551615, 18446744073709551615, 0, 10641313998539494912, 0, 140269997957756, 
            140269994049648, 140724683541272, 0, 0, 0, 0, 0, 0}}
        ret = <optimized out>
        pd = <optimized out>
        old_mask = <optimized out>
        ret = <optimized out>
        tid = <optimized out>
        ret = <optimized out>
        resultvar = <optimized out>
        resultvar = <optimized out>
        __arg3 = <optimized out>
        __arg2 = <optimized out>
        __arg1 = <optimized out>
        _a3 = <optimized out>
        _a2 = <optimized out>
        _a1 = <optimized out>
        __futex = <optimized out>
        resultvar = <optimized out>
        __arg3 = <optimized out>
        __arg2 = <optimized out>
        __arg1 = <optimized out>
        _a3 = <optimized out>
        _a2 = <optimized out>
        _a1 = <optimized out>
        __futex = <optimized out>
        __private = <optimized out>
        __oldval = <optimized out>
        result = <optimized out>
#1  __pthread_kill_internal (signo=6, threadid=140269997869056) at ./nptl/pthread_kill.c:78
No locals.
#2  __GI___pthread_kill (threadid=140269997869056, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
No locals.
#3  0x00007f9327042476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
        ret = <optimized out>
#4  0x00007f93270287f3 in __GI_abort () at ./stdlib/abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {0 <repeats 15 times>, 130843}}, sa_flags = 651013264, 
          sa_restorer = 0x7ffd04c60d40}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#5  0x00007f932765c38a in _Unwind_Resume (exc=0x5e5c200) at /opt/conda/conda-bld/gcc-compiler_1654084175708/work/gcc/libgcc/unwind.inc:245
        this_context = {reg = {0x7ffd04c60d08, 0x7ffd04c60d10, 0x0, 0x7ffd04c60d18, 0x0, 0x0, 0x7ffd04c60d40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7ffd04c60d20, 
--Type <RET> for more, q to quit, c to continue without paging--c
            0x7ffd04c60d28, 0x7ffd04c60d30, 0x7ffd04c60d38, 0x7ffd04c60d48, 0x0}, cfa = 0x7ffd04c60d50, ra = 0x7f93270298d5 <obstack_free[cold]>, lsda = 0x0, bases = {tbase = 0x0, dbase = 0x0, func = 0x7f932766aaf0 <_Unwind_Resume>}, flags = 4611686018427387904, version = 0, args_size = 0, by_value = '\000' <repeats 17 times>}
        cur_context = {reg = {0x7ffd04c60d08, 0x7ffd04c60d10, 0x0, 0x7ffd04c60d90, 0x0, 0x0, 0x7ffd04c60d98, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7ffd04c60da0, 0x7ffd04c60d28, 0x7ffd04c60d30, 0x7ffd04c60d38, 0x7ffd04c60da8, 0x0}, cfa = 0x7ffd04c60db0, ra = 0x0, lsda = 0x0, bases = {tbase = 0x0, dbase = 0x0, func = 0x7f93270298ac <__pthread_once_slow.cold>}, flags = 4611686018427387904, version = 0, args_size = 0, by_value = '\000' <repeats 17 times>}
        code = <optimized out>
        frames = 140724683541616
#6  0x00007f93270298d5 in __pthread_cleanup_combined_routine (__frame=<optimized out>) at ../sysdeps/nptl/pthreadP.h:609
No locals.
#7  __pthread_once_slow (once_control=<optimized out>, init_routine=0x7f9326cdac90 <std::__once_proxy()>) at ./nptl/pthread_once.c:114
        __cancel_routine = 0x7f9327099f40 <clear_once_control>
        __clframe = {__cancel_routine = 0x7f9327099f40 <clear_once_control>, __cancel_arg = 0x7f926e5a6be8 <torch_xla::InitXlaBackend()::register_key_flag>, __do_it = 0, __buffer = {__routine = 0x0, __arg = 0x0, __canceltype = 0, __prev = 0x0}}
        val = <optimized out>
        newval = <optimized out>
#8  0x0000000000000000 in ?? ()
No symbol table info available.
  • info threads - List all threads.
(gdb) info threads
  Id   Target Id                         Frame 
* 1    Thread 0x7f9327654800 (LWP 80077) __pthread_kill_implementation (no_tid=0, signo=6, threadid=140269997869056) at ./nptl/pthread_kill.c:44
  2    Thread 0x7f930dbfe640 (LWP 80079) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f9310dcae60 <thread_status+224>) at ./nptl/futex-internal.c:57
  3    Thread 0x7f93093fd640 (LWP 80080) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f9310dcaee0 <thread_status+352>) at ./nptl/futex-internal.c:57
  4    Thread 0x7f927cbc4640 (LWP 80137) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f9310dccb60 <thread_status+7648>) at ./nptl/futex-internal.c:57
  5    Thread 0x7f930e3ff640 (LWP 80078) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f9310dcade0 <thread_status+96>) at ./nptl/futex-internal.c:57
  6    Thread 0x7f93013f9640 (LWP 80084) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f9310dcb0e0 <thread_status+864>) at ./nptl/futex-internal.c:57
  7    Thread 0x7f92fc3f7640 (LWP 80086) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f9310dcb1e0 <thread_status+1120>) at ./nptl/futex-internal.c:57
  8    Thread 0x7f92f9bf6640 (LWP 80087) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f9310dcb260 <thread_status+1248>) at ./nptl/futex-internal.c:57
  9    Thread 0x7f92f4bf4640 (LWP 80089) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f9310dcb360 <thread_status+1504>) at ./nptl/futex-internal.c:57
  10   Thread 0x7f92753c1640 (LWP 80140) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f9310dccce0 <thread_status+8032>) at ./nptl/futex-internal.c:57
  11   Thread 0x7f92efbf2640 (LWP 80091) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f9310dcb460 <thread_status+1760>) at ./nptl/futex-internal.c:57
  12   Thread 0x7f92febf8640 (LWP 80085) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f9310dcb160 <thread_status+992>) at ./nptl/futex-internal.c:57
  13   Thread 0x7f92e5bee640 (LWP 80095) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f9310dcb660 <thread_status+2272>) at ./nptl/futex-internal.c:57
  14   Thread 0x7f92e0bec640 (LWP 80097) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f9310dcb760 <thread_status+2528>) at ./nptl/futex-internal.c:57
  15   Thread 0x7f92dbbea640 (LWP 80099) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f9310dcb860 <thread_status+2784>) at ./nptl/futex-internal.c:57
  16   Thread 0x7f92d6be8640 (LWP 80101) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f9310dcb960 <thread_status+3040>) at ./nptl/futex-internal.c:57
  17   Thread 0x7f92d1be6640 (LWP 80103) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f9310dcba60 <thread_status+3296>) at ./nptl/futex-internal.c:57
  18   Thread 0x7f92ccbe4640 (LWP 80105) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f9310dcbb60 <thread_status+3552>) at ./nptl/futex-internal.c:57
  19   Thread 0x7f92cf3e5640 (LWP 80104) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f9310dcbae0 <thread_status+3424>) at ./nptl/futex-internal.c:57
  20   Thread 0x7f92ca3e3640 (LWP 80106) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f9310dcbbe0 <thread_status+3680>) at ./nptl/futex-internal.c:57
  21   Thread 0x7f92c7be2640 (LWP 80107) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f9310dcbc60 <thread_status+3808>) at ./nptl/futex-internal.c:57
  22   Thread 0x7f92c2be0640 (LWP 80109) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f9310dcbd60 <thread_status+4064>) at ./nptl/futex-internal.c:57
  23   Thread 0x7f92c03df640 (LWP 80110) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
--Type <RET> for more, q to quit, c to continue without paging--c
    futex_word=0x7f9310dcbde0 <thread_status+4192>) at ./nptl/futex-internal.c:57
  24   Thread 0x7f92b8bdc640 (LWP 80113) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcbf60 <thread_status+4576>) at ./nptl/futex-internal.c:57
  25   Thread 0x7f92bdbde640 (LWP 80111) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcbe60 <thread_status+4320>) at ./nptl/futex-internal.c:57
  26   Thread 0x7f92bb3dd640 (LWP 80112) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcbee0 <thread_status+4448>) at ./nptl/futex-internal.c:57
  27   Thread 0x7f92b63db640 (LWP 80114) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcbfe0 <thread_status+4704>) at ./nptl/futex-internal.c:57
  28   Thread 0x7f92b3bda640 (LWP 80115) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc060 <thread_status+4832>) at ./nptl/futex-internal.c:57
  29   Thread 0x7f92ac3d7640 (LWP 80118) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc1e0 <thread_status+5216>) at ./nptl/futex-internal.c:57
  30   Thread 0x7f92a73d5640 (LWP 80120) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc2e0 <thread_status+5472>) at ./nptl/futex-internal.c:57
  31   Thread 0x7f92a4bd4640 (LWP 80121) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc360 <thread_status+5600>) at ./nptl/futex-internal.c:57
  32   Thread 0x7f92a9bd6640 (LWP 80119) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc260 <thread_status+5344>) at ./nptl/futex-internal.c:57
  33   Thread 0x7f929fbd2640 (LWP 80123) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc460 <thread_status+5856>) at ./nptl/futex-internal.c:57
  34   Thread 0x7f92aebd8640 (LWP 80117) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc160 <thread_status+5088>) at ./nptl/futex-internal.c:57
  35   Thread 0x7f92a23d3640 (LWP 80122) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc3e0 <thread_status+5728>) at ./nptl/futex-internal.c:57
  36   Thread 0x7f929abd0640 (LWP 80125) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc560 <thread_status+6112>) at ./nptl/futex-internal.c:57
  37   Thread 0x7f92983cf640 (LWP 80126) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc5e0 <thread_status+6240>) at ./nptl/futex-internal.c:57
  38   Thread 0x7f929d3d1640 (LWP 80124) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc4e0 <thread_status+5984>) at ./nptl/futex-internal.c:57
  39   Thread 0x7f9295bce640 (LWP 80127) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc660 <thread_status+6368>) at ./nptl/futex-internal.c:57
  40   Thread 0x7f92933cd640 (LWP 80128) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc6e0 <thread_status+6496>) at ./nptl/futex-internal.c:57
  41   Thread 0x7f9290bcc640 (LWP 80129) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc760 <thread_status+6624>) at ./nptl/futex-internal.c:57
  42   Thread 0x7f928e3cb640 (LWP 80130) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc7e0 <thread_status+6752>) at ./nptl/futex-internal.c:57
  43   Thread 0x7f928bbca640 (LWP 80131) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc860 <thread_status+6880>) at ./nptl/futex-internal.c:57
  44   Thread 0x7f9286bc8640 (LWP 80133) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc960 <thread_status+7136>) at ./nptl/futex-internal.c:57
  45   Thread 0x7f92843c7640 (LWP 80134) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc9e0 <thread_status+7264>) at ./nptl/futex-internal.c:57
  46   Thread 0x7f9281bc6640 (LWP 80135) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcca60 <thread_status+7392>) at ./nptl/futex-internal.c:57
  47   Thread 0x7f927a3c3640 (LWP 80138) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dccbe0 <thread_status+7776>) at ./nptl/futex-internal.c:57
  48   Thread 0x7f92893c9640 (LWP 80132) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc8e0 <thread_status+7008>) at ./nptl/futex-internal.c:57
  49   Thread 0x7f927f3c5640 (LWP 80136) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dccae0 <thread_status+7520>) at ./nptl/futex-internal.c:57
  50   Thread 0x7f9308bfc640 (LWP 80081) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcaf60 <thread_status+480>) at ./nptl/futex-internal.c:57
  51   Thread 0x7f93063fb640 (LWP 80082) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcafe0 <thread_status+608>) at ./nptl/futex-internal.c:57
  52   Thread 0x7f9277bc2640 (LWP 80139) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dccc60 <thread_status+7904>) at ./nptl/futex-internal.c:57
  53   Thread 0x7f9301bfa640 (LWP 80083) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb060 <thread_status+736>) at ./nptl/futex-internal.c:57
  54   Thread 0x7f92f23f3640 (LWP 80090) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb3e0 <thread_status+1632>) at ./nptl/futex-internal.c:57
  55   Thread 0x7f92f73f5640 (LWP 80088) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb2e0 <thread_status+1376>) at ./nptl/futex-internal.c:57
  56   Thread 0x7f92ed3f1640 (LWP 80092) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb4e0 <thread_status+1888>) at ./nptl/futex-internal.c:57
  57   Thread 0x7f92eabf0640 (LWP 80093) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb560 <thread_status+2016>) at ./nptl/futex-internal.c:57
  58   Thread 0x7f92e33ed640 (LWP 80096) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb6e0 <thread_status+2400>) at ./nptl/futex-internal.c:57
  59   Thread 0x7f92e83ef640 (LWP 80094) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb5e0 <thread_status+2144>) at ./nptl/futex-internal.c:57
  60   Thread 0x7f92de3eb640 (LWP 80098) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb7e0 <thread_status+2656>) at ./nptl/futex-internal.c:57
  61   Thread 0x7f92d93e9640 (LWP 80100) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb8e0 <thread_status+2912>) at ./nptl/futex-internal.c:57
  62   Thread 0x7f92d43e7640 (LWP 80102) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb9e0 <thread_status+3168>) at ./nptl/futex-internal.c:57
  63   Thread 0x7f92c53e1640 (LWP 80108) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcbce0 <thread_status+3936>) at ./nptl/futex-internal.c:57
  64   Thread 0x7f92b13d9640 (LWP 80116) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc0e0 <thread_status+4960>) at ./nptl/futex-internal.c:57

-list: Show the source code (if available) around the current line.

(gdb) list
39	in ./nptl/pthread_kill.c
  • info sharedlibrary: list shared libraries loaded by the program at the time of the crash.
(gdb) info sharedlibrary
From                To                  Syms Read   Shared Object Library
0x00007f9327413e00  0x00007f93274353c3  Yes (*)     /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4
0x00007f93277b2040  0x00007f93277b2105  Yes         /lib/x86_64-linux-gnu/libpthread.so.0
0x00007f93277ad040  0x00007f93277ad105  Yes         /lib/x86_64-linux-gnu/libdl.so.2
0x00007f93277a8040  0x00007f93277a8105  Yes         /lib/x86_64-linux-gnu/libutil.so.1
0x00007f93276ce3a0  0x00007f93277498c8  Yes         /lib/x86_64-linux-gnu/libm.so.6
0x00007f9327028700  0x00007f93271ba93d  Yes         /lib/x86_64-linux-gnu/libc.so.6
0x00007f93276a5280  0x00007f93276ae5bf  Yes (*)     /lib/x86_64-linux-gnu/libunwind.so.8
0x00007f9326ca5150  0x00007f9326d95b31  Yes         /home/raix/miniconda3/envs/v_xla/bin/../lib/libstdc++.so.6
0x00007f93277c0090  0x00007f93277e9315  Yes         /lib64/ld-linux-x86-64.so.2
0x00007f9327677050  0x00007f9327693c51  Yes (*)     /home/raix/miniconda3/envs/v_xla/bin/../lib/liblzma.so.5
0x00007f932765c320  0x00007f932766d6e1  Yes         /home/raix/miniconda3/envs/v_xla/bin/../lib/libgcc_s.so.1
0x00007f932763c050  0x00007f9327643411  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/math.cpython-310-x86_64-linux-gnu.so
0x00007f9327632050  0x00007f9327633081  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/fcntl.cpython-310-x86_64-linux-gnu.so
0x00007f932762b050  0x00007f932762cf71  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_posixsubprocess.cpython-310-x86_64-linux-gnu.so
0x00007f9327621050  0x00007f93276231c1  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/select.cpython-310-x86_64-linux-gnu.so
0x00007f9327290050  0x00007f932729d7d1  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so
0x00007f9327611000  0x00007f9327619791  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/../../libffi.so.8
0x00007f932727e050  0x00007f9327282a01  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_struct.cpython-310-x86_64-linux-gnu.so
0x00007f93277b7050  0x00007f93277b7391  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_opcode.cpython-310-x86_64-linux-gnu.so
0x00007f9327604050  0x00007f9327607251  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/zlib.cpython-310-x86_64-linux-gnu.so
0x00007f932725f050  0x00007f9327270241  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/../../libz.so.1
0x00007f9327256050  0x00007f9327257de1  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_bz2.cpython-310-x86_64-linux-gnu.so
0x00007f9327242050  0x00007f932724f431  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/../../libbz2.so.1.0
0x00007f9327237050  0x00007f932723a8f1  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_lzma.cpython-310-x86_64-linux-gnu.so
0x00007f932722f050  0x00007f9327230031  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_bisect.cpython-310-x86_64-linux-gnu.so
0x00007f9326efb050  0x00007f9326efcbb1  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_random.cpython-310-x86_64-linux-gnu.so
0x00007f9326ef1050  0x00007f9326ef5bf1  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_sha512.cpython-310-x86_64-linux-gnu.so
0x00007f9326eeb050  0x00007f9326eeb105  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/libtorch_global_deps.so
0x00007f9325458390  0x00007f9325f531c0  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/../../../../libmkl_intel_lp64.so
0x00007f9323602bf0  0x00007f9324da8feb  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/../../../../libmkl_gnu_thread.so
0x00007f931f01ab00  0x00007f9322713b80  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/../../../../libmkl_core.so
0x00007f9326eb1730  0x00007f9326edbec1  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/../../../../libgomp.so.1
0x00007f9326ea2050  0x00007f9326ea2115  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/_C.cpython-310-x86_64-linux-gnu.so
0x00007f931de21b40  0x00007f931e9432b8  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/libtorch_python.so
0x00007f9326e9a440  0x00007f9326e9c5d3  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/libshm.so
0x00007f9326e81890  0x00007f9326e8dc90  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/libtorch.so
0x00007f93125311c0  0x00007f931b914530  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
0x00007f9326523270  0x00007f93265accb4  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/libc10.so
0x00007f9326e66080  0x00007f9326e66275  Yes         /lib/x86_64-linux-gnu/librt.so.1
0x00007f931102da70  0x00007f9311508663  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so
0x00007f930ef18000  0x00007f9310b98c5c  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so
--Type <RET> for more, q to quit, c to continue without paging--c
0x00007f930e81b870  0x00007f930ea46837  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/core/../../numpy.libs/libgfortran-040039e1.so.5.0.0
0x00007f930e4023e0  0x00007f930e425d2b  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/core/../../numpy.libs/libquadmath-96973f99.so.0.0.0
0x00007f9326e4b050  0x00007f9326e5b571  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_datetime.cpython-310-x86_64-linux-gnu.so
0x00007f9326e2a050  0x00007f9326e3bf61  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_pickle.cpython-310-x86_64-linux-gnu.so
0x00007f9326e6c050  0x00007f9326e6c211  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_contextvars.cpython-310-x86_64-linux-gnu.so
0x00007f93250e0e70  0x00007f93250f6299  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/core/_multiarray_tests.cpython-310-x86_64-linux-gnu.so
0x00007f93250aec20  0x00007f93250cdda2  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/linalg/_umath_linalg.cpython-310-x86_64-linux-gnu.so
0x00007f9325091170  0x00007f93250a3adf  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/fft/_pocketfft_internal.cpython-310-x86_64-linux-gnu.so
0x00007f930ed528f0  0x00007f930edaafd4  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/random/mtrand.cpython-310-x86_64-linux-gnu.so
0x00007f931edcf8f0  0x00007f931edf12cf  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/random/bit_generator.cpython-310-x86_64-linux-gnu.so
0x00007f931ed90830  0x00007f931edc12ab  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/random/_common.cpython-310-x86_64-linux-gnu.so
0x00007f9326e1b050  0x00007f9326e1eff1  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/binascii.cpython-310-x86_64-linux-gnu.so
0x00007f9326af3050  0x00007f9326af8101  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_hashlib.cpython-310-x86_64-linux-gnu.so
0x00007f92706b9000  0x00007f927090412f  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/../../libcrypto.so.3
0x00007f9325084050  0x00007f932508b551  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_blake2.cpython-310-x86_64-linux-gnu.so
0x00007f931dbaf840  0x00007f931dbf3fdf  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/random/_bounded_integers.cpython-310-x86_64-linux-gnu.so
0x00007f93232e85e0  0x00007f93232f86d2  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/random/_mt19937.cpython-310-x86_64-linux-gnu.so
0x00007f931ed77610  0x00007f931ed8502c  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/random/_philox.cpython-310-x86_64-linux-gnu.so
0x00007f931db92600  0x00007f931dba35eb  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/random/_pcg64.cpython-310-x86_64-linux-gnu.so
0x00007f93232d7590  0x00007f93232df1ba  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/random/_sfc64.cpython-310-x86_64-linux-gnu.so
0x00007f930e727c90  0x00007f930e7a6f67  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/random/_generator.cpython-310-x86_64-linux-gnu.so
0x00007f93116f8050  0x00007f93116faa21  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_heapq.cpython-310-x86_64-linux-gnu.so
0x00007f9326aeb050  0x00007f9326aeba21  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/grp.cpython-310-x86_64-linux-gnu.so
0x00007f93116ed050  0x00007f93116f2eb1  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_json.cpython-310-x86_64-linux-gnu.so
0x00007f93116d9050  0x00007f93116e2f81  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/cmath.cpython-310-x86_64-linux-gnu.so
0x00007f9310eec050  0x00007f9310ef41c1  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_socket.cpython-310-x86_64-linux-gnu.so
0x00007f9310ed9050  0x00007f9310edfa81  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/array.cpython-310-x86_64-linux-gnu.so
0x00007f931db8b050  0x00007f931db8be21  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_multiprocessing.cpython-310-x86_64-linux-gnu.so
0x00007f9326e15050  0x00007f9326e15231  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_uuid.cpython-310-x86_64-linux-gnu.so
0x00007f93116d0050  0x00007f93116d35c1  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/../../libuuid.so.1
0x00007f9310eb3050  0x00007f9310ebc351  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_ssl.cpython-310-x86_64-linux-gnu.so
0x00007f930ecca050  0x00007f930ed1f993  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/../../libssl.so.3
0x00007f926f8f1050  0x00007f926f8f4461  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/unicodedata.cpython-310-x86_64-linux-gnu.so
0x00007f9310e9c050  0x00007f9310e9ca01  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_queue.cpython-310-x86_64-linux-gnu.so
0x00007f9310e8e050  0x00007f9310e92581  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_asyncio.cpython-310-x86_64-linux-gnu.so
0x00007f92637c5480  0x00007f926c0cf9ef  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
0x00007f925ee5c050  0x00007f925f066fc1  Yes         /home/raix/miniconda3/envs/v_xla/lib/libpython3.10.so.1.0
0x00007f930e6e8040  0x00007f930e6fb97b  Yes (*)     /lib/x86_64-linux-gnu/libcrypt.so.1
0x00007f9310e85090  0x00007f9310e851b5  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/charset_normalizer/md.cpython-310-x86_64-linux-gnu.so
0x00007f930e6bb280  0x00007f930e6d5e05  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/charset_normalizer/md__mypyc.cpython-310-x86_64-linux-gnu.so
0x00007f930eca0050  0x00007f930eca56f1  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_multibytecodec.cpython-310-x86_64-linux-gnu.so
0x00007f930e67d050  0x00007f930e6a36e1  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/yaml/_yaml.cpython-310-x86_64-linux-gnu.so
0x00007f930e65a040  0x00007f930e66ff45  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/yaml/../../../libyaml-0.so.2
0x00007f926e674050  0x00007f926e6c5421  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/gmpy2/gmpy2.cpython-310-x86_64-linux-gnu.so
0x00007f925e804a30  0x00007f925e814b53  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/gmpy2/../../../libmpc.so.3
0x00007f9270a4f040  0x00007f9270aaba57  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/gmpy2/../../../libmpfr.so.6
0x00007f925eb6d080  0x00007f925ebe4326  Yes (*)     /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/gmpy2/../../../libgmp.so.10
0x00007f925f1ba050  0x00007f925f1ebbf1  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_decimal.cpython-310-x86_64-linux-gnu.so
0x00007f930ec97050  0x00007f930ec97ef1  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/termios.cpython-310-x86_64-linux-gnu.so
0x00007f930e652050  0x00007f930e653f41  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_lsprof.cpython-310-x86_64-linux-gnu.so
0x00007f925d146090  0x00007f925d1d12ef  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/safetensors/_safetensors_rust.cpython-310-x86_64-linux-gnu.so
0x00007f930e646050  0x00007f930e64a181  Yes         /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_csv.cpython-310-x86_64-linux-gnu.so
(*): Shared library is missing debugging information.

PawKanarek avatar Mar 09 '24 14:03 PawKanarek

@JackCaoG Now i created the v4-8 machine with this vm version: tpu-vm-v4-pt-2.0

gcloud compute tpus tpu-vm create myname --zone=us-central2-b --accelerator-type=v4-8 --version=tpu-vm-v4-pt-2.0

And now Im getting different message, but at least it's now readable :)

python server/server.py 
WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
WARNING:root:libtpu.so and TPU device found. Setting PJRT_DEVICE=TPU.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1710013726.247688   30296 pjrt_api.cc:100] GetPjrtApi was found for tpu at /home/me/miniconda3/envs/tpu_v4/lib/python3.10/site-packages/torch_xla/lib/libtpu.so
I0000 00:00:1710013726.247769   30296 pjrt_api.cc:79] PJRT_Api is set for device type tpu
I0000 00:00:1710013726.247774   30296 pjrt_api.cc:146] The PJRT plugin has PJRT API version 0.40. The framework PJRT API version is 0.40.
/home/me/miniconda3/envs/tpu_v4/lib/python3.10/site-packages/torch_xla/runtime.py:247: UserWarning: Replicating tensors already initialized on non-virtual XLA device for SPMD to force SPMD mode. This is one-time overhead to setup, and to minimize such, please set SPMD mode before initializting tensors (i.e., call use_spmd() in the beginning of the program).
  warnings.warn(
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.07it/s]
/home/me/miniconda3/envs/tpu_v4/lib/python3.10/site-packages/transformers/training_args.py:1815: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_token` instead.
  warnings.warn(
/home/me/miniconda3/envs/tpu_v4/lib/python3.10/site-packages/transformers/training_args.py:1827: FutureWarning: `--push_to_hub_model_id` and `--push_to_hub_organization` are deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_model_id` instead and pass the full repo name to this argument (in this case google/gemma-2-it).
  warnings.warn(
https://symbolize.stripped_domain/r/?trace=7f8ddd4d4953,7f8ea111e3bf,7f8de5b4364d,7f8dde56762d,7f8de5b46273,7f8dddde807a,7f8dddbdb4ea,7f8e90515509&map= 
*** SIGSEGV (@0x1d8), see go/stacktraces#s15 received by PID 30296 (TID 31841) on cpu 195; stack trace: ***
PC: @     0x7f8ddd4d4953  (unknown)  torch_xla::runtime::PjRtComputationClient::ExecuteReplicated()::{lambda()#1}::operator()()
    @     0x7f8d6c18c6a7        928  (unknown)
    @     0x7f8ea111e3c0       1984  (unknown)
    @     0x7f8de5b4364e         32  std::_Function_handler<>::_M_invoke()
    @     0x7f8dde56762e        288  Eigen::ThreadPoolDevice::parallelFor()
    @     0x7f8de5b46274        576  tsl::thread::ThreadPool::ParallelFor()
    @     0x7f8dddde807b       1168  torch_xla::runtime::PjRtComputationClient::ExecuteReplicated()
    @     0x7f8dddbdb4eb        624  torch_xla::XLAGraphExecutor::ScheduleSyncTensorsGraph()::{lambda()#1}::operator()()
    @     0x7f8e9051550a  (unknown)  torch::lazy::MultiWait::Complete()
    @ ... and at least 1 more frames
https://symbolize.stripped_domain/r/?trace=7f8ddd4d4953,7f8d6c18c6a6,7f8ea111e3bf,7f8de5b4364d,7f8dde56762d,7f8de5b46273,7f8dddde807a,7f8dddbdb4ea,7f8e90515509&map= 
E0309 19:48:53.091365   31841 coredump_hook.cc:442] RAW: Remote crash data gathering hook invoked.
E0309 19:48:53.091373   31841 coredump_hook.cc:481] RAW: Skipping coredump since rlimit was 0 at process start.
E0309 19:48:53.091379   31841 client.cc:269] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0309 19:48:53.091381   31841 coredump_hook.cc:537] RAW: Sending fingerprint to remote end.
E0309 19:48:53.091395   31841 coredump_hook.cc:546] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0309 19:48:53.091399   31841 coredump_hook.cc:598] RAW: Dumping core locally.
E0309 19:48:53.337414   31841 process_state.cc:807] RAW: Raising signal 11 with default behavior
Segmentation fault (core dumped)

PawKanarek avatar Mar 09 '24 19:03 PawKanarek

Can you follow https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#sanity-check to run a resnet with fakedata? I am not sure if it is a env setup issue or gemma issue in your case.

JackCaoG avatar Mar 11 '24 17:03 JackCaoG

Thanks for advice, sanity check looks good on this tpu imports:

python
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import torch_xla
>>> print(torch.__version__)
2.3.0.dev20240309
>>> print(torch_xla.__version__)
2.3.0+git6043185

simple calculation

python3
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import torch_xla.core.xla_model as xm
>>> t1 = torch.tensor(100, device=xm.xla_device())
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1710270930.199793  326792 pjrt_api.cc:100] GetPjrtApi was found for tpu at /home/raix/miniconda3/envs/tpu_v4/lib/python3.10/site-packages/torch_xla/lib/libtpu.so
I0000 00:00:1710270930.199885  326792 pjrt_api.cc:79] PJRT_Api is set for device type tpu
I0000 00:00:1710270930.199890  326792 pjrt_api.cc:146] The PJRT plugin has PJRT API version 0.40. The framework PJRT API version is 0.40.
>>> t2 = torch.tensor(200, device=xm.xla_device())
>>> print(t1 + t2)
tensor(300, device='xla:0')
>>> 

imagenet

Epoch 18 train end 20:13:54
| Test Device=xla:0/0 Step=0 Epoch=18 Time=20:13:54
| Test Device=xla:0/1 Step=0 Epoch=18 Time=20:13:54
| Test Device=xla:0/3 Step=0 Epoch=18 Time=20:13:54
| Test Device=xla:0/2 Step=0 Epoch=18 Time=20:13:54
| Test Device=xla:0/3 Step=20 Epoch=18 Time=20:13:54
| Test Device=xla:0/2 Step=20 Epoch=18 Time=20:13:54
| Test Device=xla:0/1 Step=20 Epoch=18 Time=20:13:54
| Test Device=xla:0/0 Step=20 Epoch=18 Time=20:13:54
| Test Device=xla:0/1 Step=40 Epoch=18 Time=20:13:54
| Test Device=xla:0/0 Step=40 Epoch=18 Time=20:13:54
| Test Device=xla:0/3 Step=40 Epoch=18 Time=20:13:54
| Test Device=xla:0/2 Step=40 Epoch=18 Time=20:13:54
| Test Device=xla:0/1 Step=60 Epoch=18 Time=20:13:55
| Test Device=xla:0/3 Step=60 Epoch=18 Time=20:13:55
| Test Device=xla:0/0 Step=60 Epoch=18 Time=20:13:55
| Test Device=xla:0/2 Step=60 Epoch=18 Time=20:13:55
| Test Device=xla:0/1 Step=80 Epoch=18 Time=20:13:55
| Test Device=xla:0/2 Step=80 Epoch=18 Time=20:13:55
| Test Device=xla:0/0 Step=80 Epoch=18 Time=20:13:55
| Test Device=xla:0/3 Step=80 Epoch=18 Time=20:13:55
Epoch 18 test end 20:13:55, Accuracy=100.00
Max Accuracy: 100.00%

PawKanarek avatar Mar 12 '24 20:03 PawKanarek

@PawKanarek For Gemma, have you set the following env: PJRT_DEVICE=TPU XLA_USE_SPMD=1 ?

alanwaketan avatar Mar 12 '24 22:03 alanwaketan

It seems that setting export PJRT_DEVICE=TPU and export XLA_USE_SPMD=1 resolved the issue. I was certain I had exported the variables... The training now works though it occasionally crashes during training on larger datasets. But no problems on smaller datasets. Thanks!

PawKanarek avatar Mar 13 '24 00:03 PawKanarek

It seems that setting export PJRT_DEVICE=TPU and export XLA_USE_SPMD=1 resolved the issue. I was certain I had exported the variables... The training now works though it occasionally crashes during training on larger datasets. But no problems on smaller datasets. Thanks!

I would love to learn more about the crash as well! Do you mind open a new bug?

alanwaketan avatar Mar 13 '24 02:03 alanwaketan

@windmaple @PawKanarek Are we good to close this issue?

alanwaketan avatar Mar 13 '24 02:03 alanwaketan

The problem with AttributeError: module 'torch_xla.distributed.spmd' has no attribute 'set_global_mesh' was resolved on my machine.

PawKanarek avatar Mar 13 '24 10:03 PawKanarek