transformers PyTorch/XLA FSDP doesn't seem to work on TPU-v3-8 VM

PyTorch/XLA FSDP doesn't seem to work on TPU-v3-8 VM

Open vjeronymo2 opened this issue 1 year ago • 1 comments

System Info

GCP TPU-v3-8 VM Operating System: Ubuntu 20.04.4 LTS Kernel: Linux 5.13.0-1027-gcp transformers 4.28.0.dev0 (pip install git+https://github.com/huggingface/transformers.git on 03/22/2023) torch 2.0.0 torch-xla 2.0

Who can help?

People from #21406 that is @AlexWertheim, possibly @pacman100 and @ArthurZucker

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

The glue example with Trainer for TPUs without FSTP worked flawlessly in my TPU-v3-8 VM with xlm-roberta-base (because the model and batch fit properly within each core).

Now that FSTP was integrated thanks to @AlexWertheim, I tried running facebook/xlm-roberta-xl on this example with the additional parameters.

python xla_spawn.py --num_cores 8 \
    run_glue.py \
    --model_name_or_path facebook/xlm-roberta-xl \
    --task_name mnli \
    --do_train \
    --max_seq_length 128 \
    --per_device_train_batch_size 4 \
    --learning_rate 2e-5 \
    --num_train_epochs 10.0 \
    --output_dir mnli_output \
    --report_to all \
    --fsdp 'shard_grad_op' \
    --fsdp_config '../fstp_config.json' \
    --debug 'tpu_metrics_debug' \
    --logging_steps 100 \
    --gradient_accumulation_steps 8

fstp_config.json:

{
    "fsdp_min_num_params": 10000000,
    "xla": true,
    "xla_fsdp_settings": {}
}

I also tried using "fsdp_transformer_layer_cls_to_wrap": ["XLMRobertaXLModel","XLMRobertaXLClassificationHead"] instead of "fsdp_min_num_params": 10000000. Also full_shard instead of shard_grad_op and some other variations but they're all giving me the following error:

  0%|          | 1/3068000 [08:09<416756:07:35, 489.02s/it]2023-03-23 02:02:19.905715: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
2023-03-23 02:02:22.081681: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] StackTrace:
2023-03-23 02:02:22.081762: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** Begin stack trace ***
2023-03-23 02:02:22.081770: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]        tsl::CurrentStackTrace()
2023-03-23 02:02:22.081777: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]        xla::util::ReportComputationError(tsl::Status const&, absl::lts_20220623::Span<xla::XlaComputation const* const>, absl::lts_20220623::Span<xla::Shape const* const>)
2023-03-23 02:02:22.081783: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]        xla::XrtComputationClient::ExecuteComputation(xla::ComputationClient::Computation const&, absl::lts_20220623::Span<std::shared_ptr<xla::ComputationClient::Data> const>, std::string const&, xla::ComputationClient::ExecuteComputationOptions const&)
2023-03-23 02:02:22.081790: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]        torch_xla::XlaBackendImpl::ExecuteComputation(std::shared_ptr<torch::lazy::Computation>, c10::ArrayRef<std::shared_ptr<torch::lazy::BackendData> >, torch::lazy::BackendDevice const&) const
2023-03-23 02:02:22.081809: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-23 02:02:22.081818: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]        torch::lazy::MultiWait::Complete(std::function<void ()> const&)
2023-03-23 02:02:22.081825: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-23 02:02:22.081831: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-23 02:02:22.081836: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-23 02:02:22.081842: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]        clone
2023-03-23 02:02:22.081847: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** End stack trace ***
2023-03-23 02:02:22.081854: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-23 02:02:22.081862: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Status: RESOURCE_EXHAUSTED: From /job:localservice/replica:0/task:0:
2023-03-23 02:02:22.081870: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2 root error(s) found.
2023-03-23 02:02:22.081878: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   (0) RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
2023-03-23 02:02:22.081891: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]         [[{{node XRTExecute}}]]
2023-03-23 02:02:22.081898: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
2023-03-23 02:02:22.081905: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-23 02:02:22.081911: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]         [[XRTExecute_G10]]
2023-03-23 02:02:22.081920: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
2023-03-23 02:02:22.081928: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-23 02:02:22.081937: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   (1) RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
2023-03-23 02:02:22.081944: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]         [[{{node XRTExecute}}]]
2023-03-23 02:02:22.081951: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
2023-03-23 02:02:22.081959: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-23 02:02:22.081967: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 successful operations.
2023-03-23 02:02:22.081975: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 derived errors ignored.
2023-03-23 02:02:22.081983: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Recent warning and error logs:
2023-03-23 02:02:22.081989: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   OP_REQUIRES failed at tpu_execute_op.cc:266 : RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
Exception in device=TPU:1: RESOURCE_EXHAUSTED: From /job:localservice/replica:0/task:0:
2 root error(s) found.
  (0) RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
         [[{{node XRTExecute}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

         [[XRTExecute_G10]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

  (1) RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
         [[{{node XRTExecute}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
  OP_REQUIRES failed at tpu_execute_op.cc:266 : RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
Traceback (most recent call last):
  File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 334, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 328, in _start_fn
    fn(gindex, *args)
  File "/datadrive/test/run_glue.py", line 622, in _mp_fn
    main()
  File "/datadrive/test/run_glue.py", line 534, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/transformers/trainer.py", line 1644, in train
    return inner_training_loop(
  File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/transformers/trainer.py", line 1881, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/torch_xla/distributed/parallel_loader.py", line 30, in __next__
    return self.next()
  File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/torch_xla/distributed/parallel_loader.py", line 42, in next
    xm.mark_step()
  File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/torch_xla/core/xla_model.py", line 949, in mark_step
    torch_xla._XLAC._xla_step_marker(
RuntimeError: RESOURCE_EXHAUSTED: From /job:localservice/replica:0/task:0:
2 root error(s) found.
  (0) RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
         [[{{node XRTExecute}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

         [[XRTExecute_G10]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

  (1) RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
         [[{{node XRTExecute}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
  OP_REQUIRES failed at tpu_execute_op.cc:266 : RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
2023-03-23 02:02:23.050198: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
https://symbolize.stripped_domain/r/?trace=7f7627be9376,7f7627bee41f,0&map= 
*** SIGTERM received by PID 89268 (TID 89268) on cpu 51 from PID 89123; stack trace: ***
PC: @     0x7f7627be9376  (unknown)  pthread_cond_wait@@GLIBC_2.3.2
    @     0x7f74d8c2aa1a       1152  (unknown)
    @     0x7f7627bee420  (unknown)  (unknown)
    @                0x1  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7f7627be9376,7f74d8c2aa19,7f7627bee41f,0&map=ceee8fa20ddf9c34af43f587221e91de:7f74cbd02000-7f74d8e41840 
E0323 02:02:23.479201   89268 coredump_hook.cc:360] RAW: Remote crash gathering disabled for SIGTERM.
E0323 02:02:24.172933   89268 process_state.cc:784] RAW: Raising signal 15 with default behavior
2023-03-23 02:02:25.056856: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] StackTrace:
2023-03-23 02:02:25.056942: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** Begin stack trace ***
2023-03-23 02:02:25.056952: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]        tsl::CurrentStackTrace()
2023-03-23 02:02:25.056959: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]        xla::util::ReportComputationError(tsl::Status const&, absl::lts_20220623::Span<xla::XlaComputation const* const>, absl::lts_20220623::Span<xla::Shape const* const>)
2023-03-23 02:02:25.056967: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]        xla::XrtComputationClient::ExecuteComputation(xla::ComputationClient::Computation const&, absl::lts_20220623::Span<std::shared_ptr<xla::ComputationClient::Data> const>, std::string const&, xla::ComputationClient::ExecuteComputationOptions const&)
2023-03-23 02:02:25.056976: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]        torch_xla::XlaBackendImpl::ExecuteComputation(std::shared_ptr<torch::lazy::Computation>, c10::ArrayRef<std::shared_ptr<torch::lazy::BackendData> >, torch::lazy::BackendDevice const&) const
2023-03-23 02:02:25.056984: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-23 02:02:25.056997: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]        torch::lazy::MultiWait::Complete(std::function<void ()> const&)
2023-03-23 02:02:25.057005: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-23 02:02:25.057011: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-23 02:02:25.057018: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-23 02:02:25.057025: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]        clone
2023-03-23 02:02:25.057033: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** End stack trace ***
2023-03-23 02:02:25.057041: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-23 02:02:25.057050: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Status: RESOURCE_EXHAUSTED: From /job:localservice/replica:0/task:0:
2023-03-23 02:02:25.057058: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2 root error(s) found.
2023-03-23 02:02:25.057067: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   (0) RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
2023-03-23 02:02:25.057075: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]         [[{{node XRTExecute}}]]
2023-03-23 02:02:25.057085: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
2023-03-23 02:02:25.057094: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-23 02:02:25.057102: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]         [[XRTExecute_G12]]
2023-03-23 02:02:25.057111: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
2023-03-23 02:02:25.057135: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-23 02:02:25.057143: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   (1) RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
2023-03-23 02:02:25.057151: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]         [[{{node XRTExecute}}]]
2023-03-23 02:02:25.057160: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
2023-03-23 02:02:25.057168: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-23 02:02:25.057176: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 successful operations.
2023-03-23 02:02:25.057186: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 derived errors ignored.
2023-03-23 02:02:25.057194: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Recent warning and error logs:
2023-03-23 02:02:25.057202: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   OP_REQUIRES failed at tpu_execute_op.cc:266 : RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
2023-03-23 02:02:25.057209: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   OP_REQUIRES failed at tpu_execute_op.cc:266 : RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
Exception in device=TPU:6: RESOURCE_EXHAUSTED: From /job:localservice/replica:0/task:0:
2 root error(s) found.
  (0) RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
         [[{{node XRTExecute}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

         [[XRTExecute_G12]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

  (1) RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
         [[{{node XRTExecute}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
  OP_REQUIRES failed at tpu_execute_op.cc:266 : RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
  OP_REQUIRES failed at tpu_execute_op.cc:266 : RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
Traceback (most recent call last):
  File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 334, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 328, in _start_fn
    fn(gindex, *args)
  File "/datadrive/test/run_glue.py", line 622, in _mp_fn
    main()
  File "/datadrive/test/run_glue.py", line 534, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/transformers/trainer.py", line 1644, in train
    return inner_training_loop(
  File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/transformers/trainer.py", line 1881, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/torch_xla/distributed/parallel_loader.py", line 30, in __next__
    return self.next()
  File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/torch_xla/distributed/parallel_loader.py", line 42, in next
    xm.mark_step()
  File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/torch_xla/core/xla_model.py", line 949, in mark_step
    torch_xla._XLAC._xla_step_marker(
RuntimeError: RESOURCE_EXHAUSTED: From /job:localservice/replica:0/task:0:
2 root error(s) found.
  (0) RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
         [[{{node XRTExecute}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

         [[XRTExecute_G12]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

  (1) RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
         [[{{node XRTExecute}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
  OP_REQUIRES failed at tpu_execute_op.cc:266 : RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
  OP_REQUIRES failed at tpu_execute_op.cc:266 : RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
2023-03-23 02:02:29.834867: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1679536949.834650343","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2023-03-23 02:02:29.835007: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1679536949.834795697","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2023-03-23 02:02:29.835038: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Connection reset by peer" and grpc_error_string = "{"created":"@1679536949.834893793","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}", maybe retrying the RPC
2023-03-23 02:02:29.835095: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Connection reset by peer" and grpc_error_string = "{"created":"@1679536949.834956775","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}", maybe retrying the RPC
2023-03-23 02:02:29.835197: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1679536949.835008010","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2023-03-23 02:02:29.835206: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Connection reset by peer" and grpc_error_string = "{"created":"@1679536949.834976683","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}", maybe retrying the RPC
2023-03-23 02:02:29.835408: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1679536949.835235487","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2023-03-23 02:02:29.835456: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Connection reset by peer" and grpc_error_string = "{"created":"@1679536949.834964014","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}", maybe retrying the RPC
2023-03-23 02:02:29.835480: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1679536949.835338354","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2023-03-23 02:02:29.835540: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Connection reset by peer" and grpc_error_string = "{"created":"@1679536949.834899794","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}", maybe retrying the RPC
2023-03-23 02:02:29.835614: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Connection reset by peer" and grpc_error_string = "{"created":"@1679536949.834992684","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}", maybe retrying the RPC
2023-03-23 02:02:29.835687: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Connection reset by peer" and grpc_error_string = "{"created":"@1679536949.835345000","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}", maybe retrying the RPC
2023-03-23 02:02:29.835752: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Connection reset by peer" and grpc_error_string = "{"created":"@1679536949.835176851","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}", maybe retrying the RPC
Traceback (most recent call last):
  File "xla_spawn.py", line 83, in <module>
    main()
  File "xla_spawn.py", line 79, in main
    xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)
  File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 397, in spawn
    result = torch.multiprocessing.start_processes(
  File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 149, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with exit code 17
/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Expected behavior

From my understanding, the model was supposed to be split loaded onto the TPU cores, along with whatever full_shard entails, but it doesn't seem to be happening.

Mar 23 '23 02:03 vjeronymo2

I still think this still needs to be addressed

Apr 22 '23 15:04 vjeronymo2

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 19 '23 15:05 github-actions[bot]

transformers transformers copied to clipboard

PyTorch/XLA FSDP doesn't seem to work on TPU-v3-8 VM

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard