transformers icon indicating copy to clipboard operation
transformers copied to clipboard

```RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1.``` when using kaggle tpu

Open yongjer opened this issue 1 year ago • 14 comments

System Info

  • transformers version: 4.37.0.dev0
  • Platform: Linux-6.1.58+-x86_64-with-glibc2.36
  • Python version: 3.10.13
  • Huggingface_hub version: 0.19.4
  • Safetensors version: 0.4.1
  • Accelerate version: 0.25.0.dev0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.0+cu121 (False)
  • Tensorflow version (GPU?): 2.15.0 (False)
  • Flax version (CPU?/GPU?/TPU?): 0.7.4 (tpu)
  • Jax version: 0.4.17
  • JaxLib version: 0.4.17
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

No response

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

run in Kaggle with tpu VM v3-8 accelerator

!python3 \
/kaggle/input/examples/pytorch/xla_spawn.py --num_cores 8 \
/kaggle/input/examples/pytorch/text-classification/run_classification.py \
--model_name_or_path ckip-joint/bloom-1b1-zh \
--do_train \
--do_eval \
--output_dir /kaggle/working/ \
--train_file /kaggle/input/dataset/train.csv \
--validation_file /kaggle/input/dataset/test.csv \
--text_column_names sentence \
--label_column_name label \
--overwrite_output_dir \
--torch_compile \
--fp16 \
--auto_find_batch_size

error:

2023-12-22 12:57:36.401695: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-22 12:57:36.401755: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-22 12:57:36.403454: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING:root:Unsupported nprocs (8), ignoring...
/usr/local/lib/python3.10/site-packages/jax/_src/cloud_tpu_init.py:75: UserWarning: JAX_USE_PJRT_C_API_ON_TPU no longer has an effect (the new TPU runtime is always enabled now). Unset the environment variable to disable this warning.
  warnings.warn(
/usr/local/lib/python3.10/site-packages/jax/_src/cloud_tpu_init.py:75: UserWarning: JAX_USE_PJRT_C_API_ON_TPU no longer has an effect (the new TPU runtime is always enabled now). Unset the environment variable to disable this warning.
  warnings.warn(
/usr/local/lib/python3.10/site-packages/jax/_src/cloud_tpu_init.py:75: UserWarning: JAX_USE_PJRT_C_API_ON_TPU no longer has an effect (the new TPU runtime is always enabled now). Unset the environment variable to disable this warning.
  warnings.warn(
/usr/local/lib/python3.10/site-packages/jax/_src/cloud_tpu_init.py:75: UserWarning: JAX_USE_PJRT_C_API_ON_TPU no longer has an effect (the new TPU runtime is always enabled now). Unset the environment variable to disable this warning.
  warnings.warn(
2023-12-22 12:57:43.680522: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-22 12:57:43.680522: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-22 12:57:43.680585: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-22 12:57:43.680589: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-22 12:57:43.682210: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-22 12:57:43.682211: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-22 12:57:43.727851: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-22 12:57:43.727908: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-22 12:57:43.728235: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-22 12:57:43.728283: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-22 12:57:43.729554: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-22 12:57:43.729728: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 205, in _process_chunk
    return [fn(*args) for args in chunk]
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 205, in <listcomp>
    return [fn(*args) for args in chunk]
  File "/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py", line 82, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 56, in _run_thread_per_device
    initializer_fn(local_rank, local_world_size)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py", line 82, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 115, in initialize_multiprocess
    devices = xm.get_xla_supported_devices()
  File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 91, in get_xla_supported_devices
    xla_devices = _DEVICES.value
  File "/usr/local/lib/python3.10/site-packages/torch_xla/utils/utils.py", line 29, in value
    self._value = self._gen_fn()
  File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 19, in <lambda>
    _DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/kaggle/input/4-36-2/examples/pytorch/xla_spawn.py", line 83, in <module>
    main()
  File "/kaggle/input/4-36-2/examples/pytorch/xla_spawn.py", line 79, in main
    xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py", line 82, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 38, in spawn
    return pjrt.spawn(fn, nprocs, start_method, args)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 202, in spawn
    run_multiprocess(spawn_fn, start_method=start_method)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py", line 82, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 159, in run_multiprocess
    replica_results = list(
  File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 160, in <genexpr>
    itertools.chain.from_iterable(
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 575, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1.

Expected behavior

train without error

yongjer avatar Dec 22 '23 13:12 yongjer

cc @muellerzr

amyeroberts avatar Dec 22 '23 13:12 amyeroberts

Hi! Looking at the first 2 lines in the error log (after The above exception was the direct cause of the following exception:), it looks like the error occurs at the very early stage, which is relevant with xla_spawn.py rather than inside the modeling part.

  File "/kaggle/input/4-36-2/examples/pytorch/xla_spawn.py", line 83, in <module>
    main()
  File "/kaggle/input/4-36-2/examples/pytorch/xla_spawn.py", line 79, in main
    xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)

I am not sure if this is relevant to transformers (or even accelerate), but let's wait @muellerz back.

(there are people having the same issue, for example here)

ydshieh avatar Dec 22 '23 13:12 ydshieh

I'd recommend using accelerate launch and not using python. We've done work to make sure that spawn should still work fine, can you try running:

!accelerate launch --tpu --num_processes 8 \
/kaggle/input/examples/pytorch/text-classification/run_classification.py \
--model_name_or_path ckip-joint/bloom-1b1-zh \
--do_train \
--do_eval \
--output_dir /kaggle/working/ \
--train_file /kaggle/input/dataset/train.csv \
--validation_file /kaggle/input/dataset/test.csv \
--text_column_names sentence \
--label_column_name label \
--overwrite_output_dir \
--torch_compile \
--fp16 \
--auto_find_batch_size

muellerzr avatar Dec 22 '23 15:12 muellerzr

But isn't that accelerate can only use with no-trainer version script? Or I misunderstood ?

yongjer avatar Dec 22 '23 16:12 yongjer

No, accelerate is used always now as Accelerate is the heart of the Trainer :)

muellerzr avatar Dec 22 '23 16:12 muellerzr

and here is the error:

WARNING:accelerate.commands.launch:The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
2023-12-22 16:23:10.038215: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-22 16:23:10.038284: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-22 16:23:10.040151: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/usr/local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1013, in launch_command
    tpu_launcher(args)
  File "/usr/local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 745, in tpu_launcher
    if not hasattr(mod, args.main_training_function):
TypeError: hasattr(): attribute name must be string

I'd recommend using accelerate launch and not using python. We've done work to make sure that spawn should still work fine, can you try running:

!accelerate launch --tpu --num_processes 8 \
/kaggle/input/examples/pytorch/text-classification/run_classification.py \
--model_name_or_path ckip-joint/bloom-1b1-zh \
--do_train \
--do_eval \
--output_dir /kaggle/working/ \
--train_file /kaggle/input/dataset/train.csv \
--validation_file /kaggle/input/dataset/test.csv \
--text_column_names sentence \
--label_column_name label \
--overwrite_output_dir \
--torch_compile \
--fp16 \
--auto_find_batch_size

yongjer avatar Dec 22 '23 16:12 yongjer

You need to define a main_training_function as part of the command, so try doing the following:

(And thanks for your patience!)

!accelerate launch --tpu --num_processes 8 \
--main_training_function main \
/kaggle/input/examples/pytorch/text-classification/run_classification.py \
--model_name_or_path ckip-joint/bloom-1b1-zh \
--do_train \
--do_eval \
--output_dir /kaggle/working/ \
--train_file /kaggle/input/dataset/train.csv \
--validation_file /kaggle/input/dataset/test.csv \
--text_column_names sentence \
--label_column_name label \
--overwrite_output_dir \
--torch_compile \
--fp16 \
--auto_find_batch_size

muellerzr avatar Dec 22 '23 16:12 muellerzr

WARNING:accelerate.commands.launch:The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/usr/local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1013, in launch_command
    tpu_launcher(args)
  File "/usr/local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 744, in tpu_launcher
    mod = importlib.import_module(mod_name)
  File "/usr/local/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1004, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'run_classification'

You need to define a main_training_function as part of the command, so try doing the following:

(And thanks for your patience!)

!accelerate launch --tpu --num_processes 8 \
--main_training_function main \
/kaggle/input/examples/pytorch/text-classification/run_classification.py \
--model_name_or_path ckip-joint/bloom-1b1-zh \
--do_train \
--do_eval \
--output_dir /kaggle/working/ \
--train_file /kaggle/input/dataset/train.csv \
--validation_file /kaggle/input/dataset/test.csv \
--text_column_names sentence \
--label_column_name label \
--overwrite_output_dir \
--torch_compile \
--fp16 \
--auto_find_batch_size

yongjer avatar Dec 22 '23 16:12 yongjer

Thanks, I'll try and take a look at this though it will probably not be until after the holidays

muellerzr avatar Dec 22 '23 16:12 muellerzr

thanks for your help, wish you happy holidays

yongjer avatar Dec 22 '23 16:12 yongjer

I encountered the same error on kaggle's TPU VM v3-8 when using lit-gpt project's example finetuning code today. is there any progress on this issue?

IvoryTower800 avatar Jan 07 '24 16:01 IvoryTower800

Gentle ping @muellerzr

amyeroberts avatar Feb 26 '24 09:02 amyeroberts

I believe the torch XLA team is aware of this, passing their way regardless :)

muellerzr avatar Feb 26 '24 19:02 muellerzr

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Mar 22 '24 08:03 github-actions[bot]