accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

I have a problem related to multiprocessing when run accelerate

Open ghost opened this issue 3 years ago • 6 comments

System Info

- `Accelerate` version: 0.10.0
- Platform: Linux-5.4.0-1043-gcp-x86_64-with-glibc2.29
- Python version: 3.8.5
- Numpy version: 1.19.5
- PyTorch version (GPU?): 1.8.1+cu102 (False)
- `Accelerate` default config:
	- compute_environment: LOCAL_MACHINE
	- distributed_type: TPU
	- mixed_precision: no
	- use_cpu: False
	- num_processes: 8
	- machine_rank: 0
	- num_machines: 1
	- main_process_ip: None
	- main_process_port: None
	- main_training_function: main
	- deepspeed_config: {}
	- fsdp_config: {}

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • [X] My own task or dataset (give details below)

Reproduction

I'm fine-tuning GPT2 with training_with_tpu.py on TPU google cloud

  • TPU-type: V2-8
  • TPU software version: V2-alpha
  • Architecture: TPU VM

Step 1: setting

export XRT_TPU_CONFIG="localservice;0;localhost:51011"
export USE_TORCH=ON

step 2: config with accelerate

accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 3
What is the name of the function in your script that should be launched in all parallel scripts? [main]: main
How many TPU cores should be used for distributed training? [1]:8

step 3: run with accelerate accelerate launch training_with_tpu.py

This is error after running

2022-07-16 00:33:20.807295: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-07-16 00:33:20.807349: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2022-07-16 00:34:06.664709: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-07-16 00:34:06.664770: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-07-16 00:34:06.666079: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-07-16 00:34:06.666133: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-07-16 00:34:06.666506: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-07-16 00:34:06.666557: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-07-16 00:34:06.667383: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-07-16 00:34:06.667428: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-07-16 00:34:06.668147: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-07-16 00:34:06.668200: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-07-16 00:34:06.668474: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-07-16 00:34:06.668531: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-07-16 00:34:06.668559: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-07-16 00:34:06.668611: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-07-16 00:34:06.672264: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-07-16 00:34:06.672326: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-07-16 00:34:06.749831: E tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:533] Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance
2022-07-16 00:34:06.750143: E tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:533] Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance
2022-07-16 00:34:06.750449: E tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:533] Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance
2022-07-16 00:34:06.750703: E tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:533] Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance
2022-07-16 00:34:06.750999: E tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:533] Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance
2022-07-16 00:34:06.751318: E tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:533] Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance
2022-07-16 00:34:06.758432: E tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:533] Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance
2022-07-16 00:34:06.760135: E tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:533] Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
Traceback (most recent call last):
  File "<string>", line 1, in <module>
    exitcode = _main(fd, parent_sentinel)
    exitcode = _main(fd, parent_sentinel)
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/home/manhhung/hat/text_generation/training_with_accelerate.py", line 13, in <module>
    self = reduction.pickle.load(from_parent)
    self = reduction.pickle.load(from_parent)
    self = reduction.pickle.load(from_parent)
  File "/home/manhhung/hat/text_generation/training_with_accelerate.py", line 13, in <module>
  File "/home/manhhung/hat/text_generation/training_with_accelerate.py", line 13, in <module>
  File "/home/manhhung/hat/text_generation/training_with_accelerate.py", line 13, in <module>
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    accelerator = Accelerator()
    accelerator = Accelerator()
    accelerator = Accelerator()
    accelerator = Accelerator()
  File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 224, in __init__
  File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 224, in __init__
  File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 224, in __init__
  File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 224, in __init__
    self.state = AcceleratorState(
    self.state = AcceleratorState(
    self.state = AcceleratorState(
    self.state = AcceleratorState(
  File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/state.py", line 92, in __init__
  File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/state.py", line 92, in __init__
  File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/state.py", line 92, in __init__
  File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/state.py", line 92, in __init__
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
    self.local_process_index = xm.get_local_ordinal()
    self.local_process_index = xm.get_local_ordinal()
    self.local_process_index = xm.get_local_ordinal()
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 193, in get_local_ordinal
    self.local_process_index = xm.get_local_ordinal()
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 193, in get_local_ordinal
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 193, in get_local_ordinal
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 193, in get_local_ordinal
Traceback (most recent call last):
  File "<string>", line 1, in <module>
    self = reduction.pickle.load(from_parent)
  File "/home/manhhung/hat/text_generation/training_with_accelerate.py", line 13, in <module>
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    accelerator = Accelerator()
  File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 224, in __init__
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
    self.state = AcceleratorState(
    self = reduction.pickle.load(from_parent)
Traceback (most recent call last):
  File "/home/manhhung/hat/text_generation/training_with_accelerate.py", line 13, in <module>
  File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/state.py", line 92, in __init__
  File "<string>", line 1, in <module>
    accelerator = Accelerator()
  File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 224, in __init__
    self.local_process_index = xm.get_local_ordinal()
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 193, in get_local_ordinal
    self.state = AcceleratorState(
  File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/state.py", line 92, in __init__
    self.local_process_index = xm.get_local_ordinal()
    exitcode = _main(fd, parent_sentinel)
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 193, in get_local_ordinal
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
Traceback (most recent call last):
  File "/home/manhhung/hat/text_generation/training_with_accelerate.py", line 13, in <module>
  File "<string>", line 1, in <module>
    accelerator = Accelerator()
  File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 224, in __init__
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    self.state = AcceleratorState(
  File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/state.py", line 92, in __init__
    self.local_process_index = xm.get_local_ordinal()
    exitcode = _main(fd, parent_sentinel)
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 193, in get_local_ordinal
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/home/manhhung/hat/text_generation/training_with_accelerate.py", line 13, in <module>
    accelerator = Accelerator()
  File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 224, in __init__
    self.state = AcceleratorState(
  File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/state.py", line 92, in __init__
    self.local_process_index = xm.get_local_ordinal()
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 193, in get_local_ordinal
    return getattr(_get_device_context(), 'device_index', defval)
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 42, in _get_device_context
    return getattr(_get_device_context(), 'device_index', defval)
    device = torch_xla._XLAC._xla_get_default_device()
    return getattr(_get_device_context(), 'device_index', defval)
    return getattr(_get_device_context(), 'device_index', defval)
    return getattr(_get_device_context(), 'device_index', defval)
    return getattr(_get_device_context(), 'device_index', defval)
    return getattr(_get_device_context(), 'device_index', defval)
    return getattr(_get_device_context(), 'device_index', defval)
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 42, in _get_device_context
RuntimeError: tensorflow/compiler/xla/xla_client/xrt_local_service.cc:56 : Check failed: tensorflow::NewServer(server_def, &server_) == ::tensorflow::Status::OK() (Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance vs. OK)
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 42, in _get_device_context
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 42, in _get_device_context
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 42, in _get_device_context
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 42, in _get_device_context
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 42, in _get_device_context
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 42, in _get_device_context
    device = torch_xla._XLAC._xla_get_default_device()
RuntimeError: tensorflow/compiler/xla/xla_client/xrt_local_service.cc:56 : Check failed: tensorflow::NewServer(server_def, &server_) == ::tensorflow::Status::OK() (Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance vs. OK)
    device = torch_xla._XLAC._xla_get_default_device()
    device = torch_xla._XLAC._xla_get_default_device()
    device = torch_xla._XLAC._xla_get_default_device()
    device = torch_xla._XLAC._xla_get_default_device()
    device = torch_xla._XLAC._xla_get_default_device()
RuntimeError: tensorflow/compiler/xla/xla_client/xrt_local_service.cc:56 : Check failed: tensorflow::NewServer(server_def, &server_) == ::tensorflow::Status::OK() (Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance vs. OK)
    device = torch_xla._XLAC._xla_get_default_device()
RuntimeError: tensorflow/compiler/xla/xla_client/xrt_local_service.cc:56 : Check failed: tensorflow::NewServer(server_def, &server_) == ::tensorflow::Status::OK() (Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance vs. OK)
RuntimeError: tensorflow/compiler/xla/xla_client/xrt_local_service.cc:56 : Check failed: tensorflow::NewServer(server_def, &server_) == ::tensorflow::Status::OK() (Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance vs. OK)
RuntimeError: tensorflow/compiler/xla/xla_client/xrt_local_service.cc:56 : Check failed: tensorflow::NewServer(server_def, &server_) == ::tensorflow::Status::OK() (Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance vs. OK)
RuntimeError: tensorflow/compiler/xla/xla_client/xrt_local_service.cc:56 : Check failed: tensorflow::NewServer(server_def, &server_) == ::tensorflow::Status::OK() (Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance vs. OK)
RuntimeError: tensorflow/compiler/xla/xla_client/xrt_local_service.cc:56 : Check failed: tensorflow::NewServer(server_def, &server_) == ::tensorflow::Status::OK() (Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance vs. OK)
https://symbolize.stripped_domain/r/?trace=7fb7a5bafea7,7fb7a59c620f&map=
*** SIGTERM received by PID 11887 (TID 11887) on cpu 67 from PID 10096; stack trace: ***
PC: @     0x7fb7a5bafea7  (unknown)  operator delete[]()
    @     0x7fb5329741e0        976  (unknown)
    @     0x7fb7a59c6210  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7fb7a5bafea7,7fb5329741df,7fb7a59c620f&map=ca1b7ab241ee28147b3d590cadb5dc1b:7fb525c75000-7fb532ca7b20
E0716 00:34:07.105927   11887 coredump_hook.cc:250] RAW: Remote crash gathering disabled for SIGTERM.
E0716 00:34:07.109834   11887 process_state.cc:771] RAW: Raising signal 15 with default behavior
https://symbolize.stripped_domain/r/?trace=5aa05d,7facc137420f,90b09f&map=
*** SIGTERM received by PID 11895 (TID 11895) on cpu 93 from PID 10096; stack trace: ***
PC: @           0x5aa05d  (unknown)  PyTuple_ClearFreeList
    @     0x7faa3e4a21e0        976  (unknown)
    @     0x7facc1374210  (unknown)  (unknown)
    @           0x90b0a0  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=5aa05d,7faa3e4a21df,7facc137420f,90b09f&map=ca1b7ab241ee28147b3d590cadb5dc1b:7faa317a3000-7faa3e7d5b20
E0716 00:34:07.128988   11895 coredump_hook.cc:250] RAW: Remote crash gathering disabled for SIGTERM.
E0716 00:34:07.136853   11895 process_state.cc:771] RAW: Raising signal 15 with default behavior
Traceback (most recent call last):
  File "/home/manhhung/.local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/commands/launch.py", line 564, in launch_command
    tpu_launcher(args)
  File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/commands/launch.py", line 394, in tpu_launcher
    xmp.spawn(PrepareForLaunch(main_function), args=(), nprocs=args.num_processes)
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 388, in spawn
    return torch.multiprocessing.start_processes(
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 139, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 2 terminated with exit code 1
manhhung@t1v-n-952231a2-w-0:~/hat/text_generation$ accelerate env

Copy-and-paste the text below in your GitHub issue

training_with_tpu.py

import os
import torch
import numpy as np
import random
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from torch.optim import AdamW
from transformers import get_scheduler
import datasets
# from text_generation.desc_dataset import DescriptionDataset
from torch.utils.data import Dataset, DataLoader
from tqdm.auto import tqdm
from accelerate import Accelerator
accelerator = Accelerator()
seed = 42
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
random.seed(seed)


class DescriptionDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, item):
        input_ids = self.data[item]['input_ids']
        labels = self.data[item]['labels']
        attention_mask = self.data[item]['attention_mask']
        num_tokens = self.data[item]['num_tokens']
        return input_ids, labels, attention_mask, num_tokens




tokenizer = GPT2Tokenizer.from_pretrained("gpt2",
                                          bos_token='<|startoftext|>',
                                          eos_token='<|endoftext|>',
                                          pad_token='<|pad|>')

eos_title = "<|endoftitle|>"
tokenizer.add_tokens([eos_title])

model = GPT2LMHeadModel.from_pretrained('gpt2')
model.resize_token_embeddings(len(tokenizer))

optimizer = AdamW(model.parameters(), lr=5e-5)


def collate(batch, pad_token_id=tokenizer.pad_token_id):
    data_batch = {
        "input_ids": [],
        "labels": [],
        "attention_mask": [],
        "num_tokens": []
    }

    for example in batch:
        data_batch['input_ids'].append(example[0])
        data_batch['labels'].append(example[1])
        data_batch['attention_mask'].append(example[2])
        data_batch['num_tokens'].append(example[3])

    list_length = data_batch['num_tokens']
    _bs = len(list_length)
    max_len = max(list_length)

    input_ids_batch = data_batch['input_ids']
    labels_batch = data_batch['labels']
    attention_mask_batch = data_batch['attention_mask']

    for index in range(_bs):
        len_padding = max_len - list_length[index]

        input_ids_batch[index] += len_padding * [pad_token_id]
        labels_batch[index] += len_padding * [-100]
        attention_mask_batch[index] += len_padding * [0]

    return {
        "input_ids": torch.tensor(input_ids_batch, dtype=torch.long),
        "labels": torch.tensor(labels_batch, dtype=torch.long),
        "attention_mask": torch.tensor(attention_mask_batch, dtype=torch.long)
    }

path_data = "/home/hungnm2/data"

train_dataset = datasets.load_from_disk(os.path.join(path_data, "train"))
train_dataset = DescriptionDataset(train_dataset)

eval_dataset = datasets.load_from_disk(os.path.join(path_data, "eval"))
eval_dataset = DescriptionDataset(eval_dataset)

batch_size = 4
train_dataloader = DataLoader(dataset=train_dataset,
                              batch_size=batch_size,
                              collate_fn=collate,
                              shuffle=True)

eval_dataloader = DataLoader(dataset=eval_dataset,
                             batch_size=16,
                             collate_fn=collate,
                             shuffle=batch_size)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(name="linear", optimizer=optimizer, num_warmup_steps=0,
                             num_training_steps=num_training_steps)
# device = "cuda" if torch.cuda.is_available() else "cpu"
# model.to(device)

model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(model,
                                                                                        optimizer,
                                                                                        train_dataloader,
                                                                                        eval_dataloader,
                                                                                        lr_scheduler )


def save_model(model, tokenizer, output_dir):
    # Create output directory if needed
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    print("Saving model to %s" % output_dir)

    # Save a trained model, configuration and tokenizer using `save_pretrained()`.
    # They can then be reloaded using `from_pretrained()`
    model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
    model_to_save.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)


def eval():
    model.eval()
    eval_loss = 0
    for batch in eval_dataloader:

        with torch.no_grad():
            outputs = model(**batch)
            loss_batch = outputs[0].item()
            eval_loss += loss_batch

    eval_loss /= len(eval_dataloader)
    model.train()
    return eval_loss


def main():
    progress_bar = tqdm(range(num_training_steps))
    save_steps = int(num_training_steps / (num_epochs * 2))
    logging_steps = 1000
    eval_steps = int(num_training_steps / (num_epochs * 3))
    print("Model is running on {}".format(device))
    print("num_training_steps: ", num_training_steps)
    print("save_steps: ", save_steps)
    print("logging_steps: ", logging_steps)
    print("eval_steps: ", eval_steps)
    total_steps = 0
    for epoch in range(num_epochs):
        model.train()
        train_loss = 0
        for step, batch in enumerate(train_dataloader):
            total_steps += 1
            # batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            loss = outputs[0]
            batch_loss = loss.item()
            train_loss += batch_loss
            if (total_steps + 1) % logging_steps == 0:
                print("Epoch:{} & global step: {}/{}, train loss = {}".format(epoch + 1, total_steps, num_training_steps,
                                                                       batch_loss))
            # loss.backward()
            accelerator.backward(loss)

            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            progress_bar.update(1)

            if (total_steps + 1) % eval_steps == 0:
                print("evaluating...")
                eval_loss = eval()
                print(
                    "Epoch:{} & step: {}/{}, Eval loss = {}".format(epoch + 1, step, len(train_dataloader), eval_loss))

            if (total_steps + 1) % save_steps == 0:
                save_fn = "checkpoint-" + str(total_steps)
                save_model(model, tokenizer,save_fn )

            progress_bar.update(1)

        train_loss /= len(train_dataloader)
        eval_loss = eval()
        print("Done epoch:{}, Training loss = {} and eval loss = {}".format(epoch + 1, train_loss, eval_loss))


if __name__ == '__main__':
    main()

Expected behavior

I want my code able train with TPU

ghost avatar Jul 16 '22 00:07 ghost

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Aug 15 '22 15:08 github-actions[bot]

I encountered the same problem

iliemihai92 avatar Sep 24 '22 20:09 iliemihai92

I have same problem...any solutions?

hanshounsu avatar Jan 05 '23 16:01 hanshounsu

I am having the same problem!

roansong avatar Jul 13 '23 08:07 roansong

FYI i've observed this too, running on PJRT_DEVICE, also tried on TPUv3, same issue. Worth noting that training runs on the TPUs fine when not using accelerate so deffo feels like an issue with accelerate's use of multiprocessing

peter-dudbridge avatar Jul 13 '23 11:07 peter-dudbridge

Thanks for the bump all, will look at it this week

muellerzr avatar Jul 13 '23 12:07 muellerzr