transformers
transformers copied to clipboard
```RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1.``` when using kaggle tpu
System Info
-
transformers
version: 4.37.0.dev0 - Platform: Linux-6.1.58+-x86_64-with-glibc2.36
- Python version: 3.10.13
- Huggingface_hub version: 0.19.4
- Safetensors version: 0.4.1
- Accelerate version: 0.25.0.dev0
- Accelerate config: not found
- PyTorch version (GPU?): 2.1.0+cu121 (False)
- Tensorflow version (GPU?): 2.15.0 (False)
- Flax version (CPU?/GPU?/TPU?): 0.7.4 (tpu)
- Jax version: 0.4.17
- JaxLib version: 0.4.17
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help?
No response
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
run in Kaggle with tpu VM v3-8 accelerator
!python3 \
/kaggle/input/examples/pytorch/xla_spawn.py --num_cores 8 \
/kaggle/input/examples/pytorch/text-classification/run_classification.py \
--model_name_or_path ckip-joint/bloom-1b1-zh \
--do_train \
--do_eval \
--output_dir /kaggle/working/ \
--train_file /kaggle/input/dataset/train.csv \
--validation_file /kaggle/input/dataset/test.csv \
--text_column_names sentence \
--label_column_name label \
--overwrite_output_dir \
--torch_compile \
--fp16 \
--auto_find_batch_size
error:
2023-12-22 12:57:36.401695: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-22 12:57:36.401755: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-22 12:57:36.403454: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING:root:Unsupported nprocs (8), ignoring...
/usr/local/lib/python3.10/site-packages/jax/_src/cloud_tpu_init.py:75: UserWarning: JAX_USE_PJRT_C_API_ON_TPU no longer has an effect (the new TPU runtime is always enabled now). Unset the environment variable to disable this warning.
warnings.warn(
/usr/local/lib/python3.10/site-packages/jax/_src/cloud_tpu_init.py:75: UserWarning: JAX_USE_PJRT_C_API_ON_TPU no longer has an effect (the new TPU runtime is always enabled now). Unset the environment variable to disable this warning.
warnings.warn(
/usr/local/lib/python3.10/site-packages/jax/_src/cloud_tpu_init.py:75: UserWarning: JAX_USE_PJRT_C_API_ON_TPU no longer has an effect (the new TPU runtime is always enabled now). Unset the environment variable to disable this warning.
warnings.warn(
/usr/local/lib/python3.10/site-packages/jax/_src/cloud_tpu_init.py:75: UserWarning: JAX_USE_PJRT_C_API_ON_TPU no longer has an effect (the new TPU runtime is always enabled now). Unset the environment variable to disable this warning.
warnings.warn(
2023-12-22 12:57:43.680522: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-22 12:57:43.680522: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-22 12:57:43.680585: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-22 12:57:43.680589: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-22 12:57:43.682210: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-22 12:57:43.682211: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-22 12:57:43.727851: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-22 12:57:43.727908: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-22 12:57:43.728235: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-22 12:57:43.728283: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-22 12:57:43.729554: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-22 12:57:43.729728: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 205, in _process_chunk
return [fn(*args) for args in chunk]
File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 205, in <listcomp>
return [fn(*args) for args in chunk]
File "/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py", line 82, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 56, in _run_thread_per_device
initializer_fn(local_rank, local_world_size)
File "/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py", line 82, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 115, in initialize_multiprocess
devices = xm.get_xla_supported_devices()
File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 91, in get_xla_supported_devices
xla_devices = _DEVICES.value
File "/usr/local/lib/python3.10/site-packages/torch_xla/utils/utils.py", line 29, in value
self._value = self._gen_fn()
File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 19, in <lambda>
_DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1.
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/kaggle/input/4-36-2/examples/pytorch/xla_spawn.py", line 83, in <module>
main()
File "/kaggle/input/4-36-2/examples/pytorch/xla_spawn.py", line 79, in main
xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)
File "/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py", line 82, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 38, in spawn
return pjrt.spawn(fn, nprocs, start_method, args)
File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 202, in spawn
run_multiprocess(spawn_fn, start_method=start_method)
File "/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py", line 82, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 159, in run_multiprocess
replica_results = list(
File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 160, in <genexpr>
itertools.chain.from_iterable(
File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 575, in _chain_from_iterable_of_lists
for element in iterable:
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
yield _result_or_cancel(fs.pop())
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
return fut.result(timeout)
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 458, in result
return self.__get_result()
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1.
Expected behavior
train without error
cc @muellerzr
Hi! Looking at the first 2 lines in the error log (after The above exception was the direct cause of the following exception:
), it looks like the error occurs at the very early stage, which is relevant with xla_spawn.py
rather than inside the modeling part.
File "/kaggle/input/4-36-2/examples/pytorch/xla_spawn.py", line 83, in <module>
main()
File "/kaggle/input/4-36-2/examples/pytorch/xla_spawn.py", line 79, in main
xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)
I am not sure if this is relevant to transformers
(or even accelerate
), but let's wait @muellerz back.
(there are people having the same issue, for example here)
I'd recommend using accelerate launch
and not using python
. We've done work to make sure that spawn should still work fine, can you try running:
!accelerate launch --tpu --num_processes 8 \
/kaggle/input/examples/pytorch/text-classification/run_classification.py \
--model_name_or_path ckip-joint/bloom-1b1-zh \
--do_train \
--do_eval \
--output_dir /kaggle/working/ \
--train_file /kaggle/input/dataset/train.csv \
--validation_file /kaggle/input/dataset/test.csv \
--text_column_names sentence \
--label_column_name label \
--overwrite_output_dir \
--torch_compile \
--fp16 \
--auto_find_batch_size
But isn't that accelerate can only use with no-trainer version script? Or I misunderstood ?
No, accelerate is used always now as Accelerate is the heart of the Trainer :)
and here is the error:
WARNING:accelerate.commands.launch:The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_machines` was set to a value of `1`
`--mixed_precision` was set to a value of `'no'`
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
2023-12-22 16:23:10.038215: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-22 16:23:10.038284: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-22 16:23:10.040151: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/usr/local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1013, in launch_command
tpu_launcher(args)
File "/usr/local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 745, in tpu_launcher
if not hasattr(mod, args.main_training_function):
TypeError: hasattr(): attribute name must be string
I'd recommend using
accelerate launch
and not usingpython
. We've done work to make sure that spawn should still work fine, can you try running:!accelerate launch --tpu --num_processes 8 \ /kaggle/input/examples/pytorch/text-classification/run_classification.py \ --model_name_or_path ckip-joint/bloom-1b1-zh \ --do_train \ --do_eval \ --output_dir /kaggle/working/ \ --train_file /kaggle/input/dataset/train.csv \ --validation_file /kaggle/input/dataset/test.csv \ --text_column_names sentence \ --label_column_name label \ --overwrite_output_dir \ --torch_compile \ --fp16 \ --auto_find_batch_size
You need to define a main_training_function
as part of the command, so try doing the following:
(And thanks for your patience!)
!accelerate launch --tpu --num_processes 8 \
--main_training_function main \
/kaggle/input/examples/pytorch/text-classification/run_classification.py \
--model_name_or_path ckip-joint/bloom-1b1-zh \
--do_train \
--do_eval \
--output_dir /kaggle/working/ \
--train_file /kaggle/input/dataset/train.csv \
--validation_file /kaggle/input/dataset/test.csv \
--text_column_names sentence \
--label_column_name label \
--overwrite_output_dir \
--torch_compile \
--fp16 \
--auto_find_batch_size
WARNING:accelerate.commands.launch:The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_machines` was set to a value of `1`
`--mixed_precision` was set to a value of `'no'`
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/usr/local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1013, in launch_command
tpu_launcher(args)
File "/usr/local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 744, in tpu_launcher
mod = importlib.import_module(mod_name)
File "/usr/local/lib/python3.10/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1004, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'run_classification'
You need to define a
main_training_function
as part of the command, so try doing the following:(And thanks for your patience!)
!accelerate launch --tpu --num_processes 8 \ --main_training_function main \ /kaggle/input/examples/pytorch/text-classification/run_classification.py \ --model_name_or_path ckip-joint/bloom-1b1-zh \ --do_train \ --do_eval \ --output_dir /kaggle/working/ \ --train_file /kaggle/input/dataset/train.csv \ --validation_file /kaggle/input/dataset/test.csv \ --text_column_names sentence \ --label_column_name label \ --overwrite_output_dir \ --torch_compile \ --fp16 \ --auto_find_batch_size
Thanks, I'll try and take a look at this though it will probably not be until after the holidays
thanks for your help, wish you happy holidays
I encountered the same error on kaggle's TPU VM v3-8 when using lit-gpt project's example finetuning code today. is there any progress on this issue?
Gentle ping @muellerzr
I believe the torch XLA team is aware of this, passing their way regardless :)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.