transformers
transformers copied to clipboard
PyTorch/XLA FSDP doesn't seem to work on TPU-v3-8 VM
System Info
GCP TPU-v3-8 VM Operating System: Ubuntu 20.04.4 LTS Kernel: Linux 5.13.0-1027-gcp transformers 4.28.0.dev0 (pip install git+https://github.com/huggingface/transformers.git on 03/22/2023) torch 2.0.0 torch-xla 2.0
Who can help?
People from #21406 that is @AlexWertheim, possibly @pacman100 and @ArthurZucker
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
The glue example with Trainer for TPUs without FSTP worked flawlessly in my TPU-v3-8 VM with xlm-roberta-base (because the model and batch fit properly within each core).
Now that FSTP was integrated thanks to @AlexWertheim, I tried running facebook/xlm-roberta-xl on this example with the additional parameters.
python xla_spawn.py --num_cores 8 \
run_glue.py \
--model_name_or_path facebook/xlm-roberta-xl \
--task_name mnli \
--do_train \
--max_seq_length 128 \
--per_device_train_batch_size 4 \
--learning_rate 2e-5 \
--num_train_epochs 10.0 \
--output_dir mnli_output \
--report_to all \
--fsdp 'shard_grad_op' \
--fsdp_config '../fstp_config.json' \
--debug 'tpu_metrics_debug' \
--logging_steps 100 \
--gradient_accumulation_steps 8
fstp_config.json:
{
"fsdp_min_num_params": 10000000,
"xla": true,
"xla_fsdp_settings": {}
}
I also tried using "fsdp_transformer_layer_cls_to_wrap": ["XLMRobertaXLModel","XLMRobertaXLClassificationHead"]
instead of "fsdp_min_num_params": 10000000
. Also full_shard
instead of shard_grad_op
and some other variations but they're all giving me the following error:
0%| | 1/3068000 [08:09<416756:07:35, 489.02s/it]2023-03-23 02:02:19.905715: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
2023-03-23 02:02:22.081681: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] StackTrace:
2023-03-23 02:02:22.081762: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** Begin stack trace ***
2023-03-23 02:02:22.081770: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] tsl::CurrentStackTrace()
2023-03-23 02:02:22.081777: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::util::ReportComputationError(tsl::Status const&, absl::lts_20220623::Span<xla::XlaComputation const* const>, absl::lts_20220623::Span<xla::Shape const* const>)
2023-03-23 02:02:22.081783: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::XrtComputationClient::ExecuteComputation(xla::ComputationClient::Computation const&, absl::lts_20220623::Span<std::shared_ptr<xla::ComputationClient::Data> const>, std::string const&, xla::ComputationClient::ExecuteComputationOptions const&)
2023-03-23 02:02:22.081790: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] torch_xla::XlaBackendImpl::ExecuteComputation(std::shared_ptr<torch::lazy::Computation>, c10::ArrayRef<std::shared_ptr<torch::lazy::BackendData> >, torch::lazy::BackendDevice const&) const
2023-03-23 02:02:22.081809: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-03-23 02:02:22.081818: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] torch::lazy::MultiWait::Complete(std::function<void ()> const&)
2023-03-23 02:02:22.081825: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-03-23 02:02:22.081831: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-03-23 02:02:22.081836: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-03-23 02:02:22.081842: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] clone
2023-03-23 02:02:22.081847: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** End stack trace ***
2023-03-23 02:02:22.081854: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-03-23 02:02:22.081862: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Status: RESOURCE_EXHAUSTED: From /job:localservice/replica:0/task:0:
2023-03-23 02:02:22.081870: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2 root error(s) found.
2023-03-23 02:02:22.081878: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] (0) RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
2023-03-23 02:02:22.081891: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] [[{{node XRTExecute}}]]
2023-03-23 02:02:22.081898: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
2023-03-23 02:02:22.081905: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-03-23 02:02:22.081911: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] [[XRTExecute_G10]]
2023-03-23 02:02:22.081920: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
2023-03-23 02:02:22.081928: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-03-23 02:02:22.081937: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] (1) RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
2023-03-23 02:02:22.081944: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] [[{{node XRTExecute}}]]
2023-03-23 02:02:22.081951: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
2023-03-23 02:02:22.081959: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-03-23 02:02:22.081967: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 successful operations.
2023-03-23 02:02:22.081975: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 derived errors ignored.
2023-03-23 02:02:22.081983: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Recent warning and error logs:
2023-03-23 02:02:22.081989: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] OP_REQUIRES failed at tpu_execute_op.cc:266 : RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
Exception in device=TPU:1: RESOURCE_EXHAUSTED: From /job:localservice/replica:0/task:0:
2 root error(s) found.
(0) RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
[[{{node XRTExecute}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[[XRTExecute_G10]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
(1) RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
[[{{node XRTExecute}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
OP_REQUIRES failed at tpu_execute_op.cc:266 : RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
Traceback (most recent call last):
File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 334, in _mp_start_fn
_start_fn(index, pf_cfg, fn, args)
File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 328, in _start_fn
fn(gindex, *args)
File "/datadrive/test/run_glue.py", line 622, in _mp_fn
main()
File "/datadrive/test/run_glue.py", line 534, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/transformers/trainer.py", line 1644, in train
return inner_training_loop(
File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/transformers/trainer.py", line 1881, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/torch_xla/distributed/parallel_loader.py", line 30, in __next__
return self.next()
File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/torch_xla/distributed/parallel_loader.py", line 42, in next
xm.mark_step()
File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/torch_xla/core/xla_model.py", line 949, in mark_step
torch_xla._XLAC._xla_step_marker(
RuntimeError: RESOURCE_EXHAUSTED: From /job:localservice/replica:0/task:0:
2 root error(s) found.
(0) RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
[[{{node XRTExecute}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[[XRTExecute_G10]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
(1) RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
[[{{node XRTExecute}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
OP_REQUIRES failed at tpu_execute_op.cc:266 : RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
2023-03-23 02:02:23.050198: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
https://symbolize.stripped_domain/r/?trace=7f7627be9376,7f7627bee41f,0&map=
*** SIGTERM received by PID 89268 (TID 89268) on cpu 51 from PID 89123; stack trace: ***
PC: @ 0x7f7627be9376 (unknown) pthread_cond_wait@@GLIBC_2.3.2
@ 0x7f74d8c2aa1a 1152 (unknown)
@ 0x7f7627bee420 (unknown) (unknown)
@ 0x1 (unknown) (unknown)
https://symbolize.stripped_domain/r/?trace=7f7627be9376,7f74d8c2aa19,7f7627bee41f,0&map=ceee8fa20ddf9c34af43f587221e91de:7f74cbd02000-7f74d8e41840
E0323 02:02:23.479201 89268 coredump_hook.cc:360] RAW: Remote crash gathering disabled for SIGTERM.
E0323 02:02:24.172933 89268 process_state.cc:784] RAW: Raising signal 15 with default behavior
2023-03-23 02:02:25.056856: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] StackTrace:
2023-03-23 02:02:25.056942: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** Begin stack trace ***
2023-03-23 02:02:25.056952: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] tsl::CurrentStackTrace()
2023-03-23 02:02:25.056959: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::util::ReportComputationError(tsl::Status const&, absl::lts_20220623::Span<xla::XlaComputation const* const>, absl::lts_20220623::Span<xla::Shape const* const>)
2023-03-23 02:02:25.056967: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::XrtComputationClient::ExecuteComputation(xla::ComputationClient::Computation const&, absl::lts_20220623::Span<std::shared_ptr<xla::ComputationClient::Data> const>, std::string const&, xla::ComputationClient::ExecuteComputationOptions const&)
2023-03-23 02:02:25.056976: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] torch_xla::XlaBackendImpl::ExecuteComputation(std::shared_ptr<torch::lazy::Computation>, c10::ArrayRef<std::shared_ptr<torch::lazy::BackendData> >, torch::lazy::BackendDevice const&) const
2023-03-23 02:02:25.056984: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-03-23 02:02:25.056997: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] torch::lazy::MultiWait::Complete(std::function<void ()> const&)
2023-03-23 02:02:25.057005: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-03-23 02:02:25.057011: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-03-23 02:02:25.057018: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-03-23 02:02:25.057025: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] clone
2023-03-23 02:02:25.057033: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** End stack trace ***
2023-03-23 02:02:25.057041: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-03-23 02:02:25.057050: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Status: RESOURCE_EXHAUSTED: From /job:localservice/replica:0/task:0:
2023-03-23 02:02:25.057058: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2 root error(s) found.
2023-03-23 02:02:25.057067: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] (0) RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
2023-03-23 02:02:25.057075: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] [[{{node XRTExecute}}]]
2023-03-23 02:02:25.057085: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
2023-03-23 02:02:25.057094: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-03-23 02:02:25.057102: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] [[XRTExecute_G12]]
2023-03-23 02:02:25.057111: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
2023-03-23 02:02:25.057135: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-03-23 02:02:25.057143: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] (1) RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
2023-03-23 02:02:25.057151: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] [[{{node XRTExecute}}]]
2023-03-23 02:02:25.057160: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
2023-03-23 02:02:25.057168: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-03-23 02:02:25.057176: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 successful operations.
2023-03-23 02:02:25.057186: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 derived errors ignored.
2023-03-23 02:02:25.057194: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Recent warning and error logs:
2023-03-23 02:02:25.057202: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] OP_REQUIRES failed at tpu_execute_op.cc:266 : RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
2023-03-23 02:02:25.057209: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] OP_REQUIRES failed at tpu_execute_op.cc:266 : RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
Exception in device=TPU:6: RESOURCE_EXHAUSTED: From /job:localservice/replica:0/task:0:
2 root error(s) found.
(0) RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
[[{{node XRTExecute}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[[XRTExecute_G12]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
(1) RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
[[{{node XRTExecute}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
OP_REQUIRES failed at tpu_execute_op.cc:266 : RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
OP_REQUIRES failed at tpu_execute_op.cc:266 : RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
Traceback (most recent call last):
File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 334, in _mp_start_fn
_start_fn(index, pf_cfg, fn, args)
File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 328, in _start_fn
fn(gindex, *args)
File "/datadrive/test/run_glue.py", line 622, in _mp_fn
main()
File "/datadrive/test/run_glue.py", line 534, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/transformers/trainer.py", line 1644, in train
return inner_training_loop(
File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/transformers/trainer.py", line 1881, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/torch_xla/distributed/parallel_loader.py", line 30, in __next__
return self.next()
File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/torch_xla/distributed/parallel_loader.py", line 42, in next
xm.mark_step()
File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/torch_xla/core/xla_model.py", line 949, in mark_step
torch_xla._XLAC._xla_step_marker(
RuntimeError: RESOURCE_EXHAUSTED: From /job:localservice/replica:0/task:0:
2 root error(s) found.
(0) RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
[[{{node XRTExecute}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[[XRTExecute_G12]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
(1) RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
[[{{node XRTExecute}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
OP_REQUIRES failed at tpu_execute_op.cc:266 : RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
OP_REQUIRES failed at tpu_execute_op.cc:266 : RESOURCE_EXHAUSTED: Attempting to reserve 10.51G at the bottom of memory. That was not possible. There are 8.97G free, 0B reserved, and 8.97G reservable.
2023-03-23 02:02:29.834867: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1679536949.834650343","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2023-03-23 02:02:29.835007: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1679536949.834795697","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2023-03-23 02:02:29.835038: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Connection reset by peer" and grpc_error_string = "{"created":"@1679536949.834893793","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}", maybe retrying the RPC
2023-03-23 02:02:29.835095: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Connection reset by peer" and grpc_error_string = "{"created":"@1679536949.834956775","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}", maybe retrying the RPC
2023-03-23 02:02:29.835197: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1679536949.835008010","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2023-03-23 02:02:29.835206: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Connection reset by peer" and grpc_error_string = "{"created":"@1679536949.834976683","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}", maybe retrying the RPC
2023-03-23 02:02:29.835408: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1679536949.835235487","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2023-03-23 02:02:29.835456: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Connection reset by peer" and grpc_error_string = "{"created":"@1679536949.834964014","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}", maybe retrying the RPC
2023-03-23 02:02:29.835480: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1679536949.835338354","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2023-03-23 02:02:29.835540: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Connection reset by peer" and grpc_error_string = "{"created":"@1679536949.834899794","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}", maybe retrying the RPC
2023-03-23 02:02:29.835614: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Connection reset by peer" and grpc_error_string = "{"created":"@1679536949.834992684","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}", maybe retrying the RPC
2023-03-23 02:02:29.835687: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Connection reset by peer" and grpc_error_string = "{"created":"@1679536949.835345000","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}", maybe retrying the RPC
2023-03-23 02:02:29.835752: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Connection reset by peer" and grpc_error_string = "{"created":"@1679536949.835176851","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}", maybe retrying the RPC
Traceback (most recent call last):
File "xla_spawn.py", line 83, in <module>
main()
File "xla_spawn.py", line 79, in main
xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)
File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 397, in spawn
result = torch.multiprocessing.start_processes(
File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 149, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with exit code 17
/home/vitor_jeronymo/miniconda3/envs/torch-xla/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Expected behavior
From my understanding, the model was supposed to be split loaded onto the TPU cores, along with whatever full_shard
entails, but it doesn't seem to be happening.
I still think this still needs to be addressed
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.