accelerate
accelerate copied to clipboard
I have a problem related to multiprocessing when run accelerate
System Info
- `Accelerate` version: 0.10.0
- Platform: Linux-5.4.0-1043-gcp-x86_64-with-glibc2.29
- Python version: 3.8.5
- Numpy version: 1.19.5
- PyTorch version (GPU?): 1.8.1+cu102 (False)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: TPU
- mixed_precision: no
- use_cpu: False
- num_processes: 8
- machine_rank: 0
- num_machines: 1
- main_process_ip: None
- main_process_port: None
- main_training_function: main
- deepspeed_config: {}
- fsdp_config: {}
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - [X] My own task or dataset (give details below)
Reproduction
I'm fine-tuning GPT2 with training_with_tpu.py on TPU google cloud
- TPU-type: V2-8
- TPU software version: V2-alpha
- Architecture: TPU VM
Step 1: setting
export XRT_TPU_CONFIG="localservice;0;localhost:51011"
export USE_TORCH=ON
step 2: config with accelerate
accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 3
What is the name of the function in your script that should be launched in all parallel scripts? [main]: main
How many TPU cores should be used for distributed training? [1]:8
step 3: run with accelerate
accelerate launch training_with_tpu.py
This is error after running
2022-07-16 00:33:20.807295: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-07-16 00:33:20.807349: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2022-07-16 00:34:06.664709: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-07-16 00:34:06.664770: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-07-16 00:34:06.666079: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-07-16 00:34:06.666133: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-07-16 00:34:06.666506: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-07-16 00:34:06.666557: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-07-16 00:34:06.667383: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-07-16 00:34:06.667428: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-07-16 00:34:06.668147: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-07-16 00:34:06.668200: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-07-16 00:34:06.668474: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-07-16 00:34:06.668531: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-07-16 00:34:06.668559: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-07-16 00:34:06.668611: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-07-16 00:34:06.672264: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-07-16 00:34:06.672326: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-07-16 00:34:06.749831: E tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:533] Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance
2022-07-16 00:34:06.750143: E tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:533] Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance
2022-07-16 00:34:06.750449: E tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:533] Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance
2022-07-16 00:34:06.750703: E tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:533] Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance
2022-07-16 00:34:06.750999: E tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:533] Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance
2022-07-16 00:34:06.751318: E tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:533] Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance
2022-07-16 00:34:06.758432: E tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:533] Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance
2022-07-16 00:34:06.760135: E tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:533] Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "<string>", line 1, in <module>
File "<string>", line 1, in <module>
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
Traceback (most recent call last):
File "<string>", line 1, in <module>
exitcode = _main(fd, parent_sentinel)
exitcode = _main(fd, parent_sentinel)
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/home/manhhung/hat/text_generation/training_with_accelerate.py", line 13, in <module>
self = reduction.pickle.load(from_parent)
self = reduction.pickle.load(from_parent)
self = reduction.pickle.load(from_parent)
File "/home/manhhung/hat/text_generation/training_with_accelerate.py", line 13, in <module>
File "/home/manhhung/hat/text_generation/training_with_accelerate.py", line 13, in <module>
File "/home/manhhung/hat/text_generation/training_with_accelerate.py", line 13, in <module>
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
accelerator = Accelerator()
accelerator = Accelerator()
accelerator = Accelerator()
accelerator = Accelerator()
File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 224, in __init__
File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 224, in __init__
File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 224, in __init__
File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 224, in __init__
self.state = AcceleratorState(
self.state = AcceleratorState(
self.state = AcceleratorState(
self.state = AcceleratorState(
File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/state.py", line 92, in __init__
File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/state.py", line 92, in __init__
File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/state.py", line 92, in __init__
File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/state.py", line 92, in __init__
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
self.local_process_index = xm.get_local_ordinal()
self.local_process_index = xm.get_local_ordinal()
self.local_process_index = xm.get_local_ordinal()
File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 193, in get_local_ordinal
self.local_process_index = xm.get_local_ordinal()
File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 193, in get_local_ordinal
File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 193, in get_local_ordinal
File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 193, in get_local_ordinal
Traceback (most recent call last):
File "<string>", line 1, in <module>
self = reduction.pickle.load(from_parent)
File "/home/manhhung/hat/text_generation/training_with_accelerate.py", line 13, in <module>
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
accelerator = Accelerator()
File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 224, in __init__
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
self.state = AcceleratorState(
self = reduction.pickle.load(from_parent)
Traceback (most recent call last):
File "/home/manhhung/hat/text_generation/training_with_accelerate.py", line 13, in <module>
File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/state.py", line 92, in __init__
File "<string>", line 1, in <module>
accelerator = Accelerator()
File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 224, in __init__
self.local_process_index = xm.get_local_ordinal()
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 193, in get_local_ordinal
self.state = AcceleratorState(
File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/state.py", line 92, in __init__
self.local_process_index = xm.get_local_ordinal()
exitcode = _main(fd, parent_sentinel)
File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 193, in get_local_ordinal
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
Traceback (most recent call last):
File "/home/manhhung/hat/text_generation/training_with_accelerate.py", line 13, in <module>
File "<string>", line 1, in <module>
accelerator = Accelerator()
File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 224, in __init__
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
self.state = AcceleratorState(
File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/state.py", line 92, in __init__
self.local_process_index = xm.get_local_ordinal()
exitcode = _main(fd, parent_sentinel)
File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 193, in get_local_ordinal
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/home/manhhung/hat/text_generation/training_with_accelerate.py", line 13, in <module>
accelerator = Accelerator()
File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 224, in __init__
self.state = AcceleratorState(
File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/state.py", line 92, in __init__
self.local_process_index = xm.get_local_ordinal()
File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 193, in get_local_ordinal
return getattr(_get_device_context(), 'device_index', defval)
File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 42, in _get_device_context
return getattr(_get_device_context(), 'device_index', defval)
device = torch_xla._XLAC._xla_get_default_device()
return getattr(_get_device_context(), 'device_index', defval)
return getattr(_get_device_context(), 'device_index', defval)
return getattr(_get_device_context(), 'device_index', defval)
return getattr(_get_device_context(), 'device_index', defval)
return getattr(_get_device_context(), 'device_index', defval)
return getattr(_get_device_context(), 'device_index', defval)
File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 42, in _get_device_context
RuntimeError: tensorflow/compiler/xla/xla_client/xrt_local_service.cc:56 : Check failed: tensorflow::NewServer(server_def, &server_) == ::tensorflow::Status::OK() (Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance vs. OK)
File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 42, in _get_device_context
File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 42, in _get_device_context
File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 42, in _get_device_context
File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 42, in _get_device_context
File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 42, in _get_device_context
File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 42, in _get_device_context
device = torch_xla._XLAC._xla_get_default_device()
RuntimeError: tensorflow/compiler/xla/xla_client/xrt_local_service.cc:56 : Check failed: tensorflow::NewServer(server_def, &server_) == ::tensorflow::Status::OK() (Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance vs. OK)
device = torch_xla._XLAC._xla_get_default_device()
device = torch_xla._XLAC._xla_get_default_device()
device = torch_xla._XLAC._xla_get_default_device()
device = torch_xla._XLAC._xla_get_default_device()
device = torch_xla._XLAC._xla_get_default_device()
RuntimeError: tensorflow/compiler/xla/xla_client/xrt_local_service.cc:56 : Check failed: tensorflow::NewServer(server_def, &server_) == ::tensorflow::Status::OK() (Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance vs. OK)
device = torch_xla._XLAC._xla_get_default_device()
RuntimeError: tensorflow/compiler/xla/xla_client/xrt_local_service.cc:56 : Check failed: tensorflow::NewServer(server_def, &server_) == ::tensorflow::Status::OK() (Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance vs. OK)
RuntimeError: tensorflow/compiler/xla/xla_client/xrt_local_service.cc:56 : Check failed: tensorflow::NewServer(server_def, &server_) == ::tensorflow::Status::OK() (Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance vs. OK)
RuntimeError: tensorflow/compiler/xla/xla_client/xrt_local_service.cc:56 : Check failed: tensorflow::NewServer(server_def, &server_) == ::tensorflow::Status::OK() (Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance vs. OK)
RuntimeError: tensorflow/compiler/xla/xla_client/xrt_local_service.cc:56 : Check failed: tensorflow::NewServer(server_def, &server_) == ::tensorflow::Status::OK() (Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance vs. OK)
RuntimeError: tensorflow/compiler/xla/xla_client/xrt_local_service.cc:56 : Check failed: tensorflow::NewServer(server_def, &server_) == ::tensorflow::Status::OK() (Invalid argument: Invalid fd: -1; Couldn't open device: /dev/accel0 (Operation not permitted); Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance vs. OK)
https://symbolize.stripped_domain/r/?trace=7fb7a5bafea7,7fb7a59c620f&map=
*** SIGTERM received by PID 11887 (TID 11887) on cpu 67 from PID 10096; stack trace: ***
PC: @ 0x7fb7a5bafea7 (unknown) operator delete[]()
@ 0x7fb5329741e0 976 (unknown)
@ 0x7fb7a59c6210 (unknown) (unknown)
https://symbolize.stripped_domain/r/?trace=7fb7a5bafea7,7fb5329741df,7fb7a59c620f&map=ca1b7ab241ee28147b3d590cadb5dc1b:7fb525c75000-7fb532ca7b20
E0716 00:34:07.105927 11887 coredump_hook.cc:250] RAW: Remote crash gathering disabled for SIGTERM.
E0716 00:34:07.109834 11887 process_state.cc:771] RAW: Raising signal 15 with default behavior
https://symbolize.stripped_domain/r/?trace=5aa05d,7facc137420f,90b09f&map=
*** SIGTERM received by PID 11895 (TID 11895) on cpu 93 from PID 10096; stack trace: ***
PC: @ 0x5aa05d (unknown) PyTuple_ClearFreeList
@ 0x7faa3e4a21e0 976 (unknown)
@ 0x7facc1374210 (unknown) (unknown)
@ 0x90b0a0 (unknown) (unknown)
https://symbolize.stripped_domain/r/?trace=5aa05d,7faa3e4a21df,7facc137420f,90b09f&map=ca1b7ab241ee28147b3d590cadb5dc1b:7faa317a3000-7faa3e7d5b20
E0716 00:34:07.128988 11895 coredump_hook.cc:250] RAW: Remote crash gathering disabled for SIGTERM.
E0716 00:34:07.136853 11895 process_state.cc:771] RAW: Raising signal 15 with default behavior
Traceback (most recent call last):
File "/home/manhhung/.local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/commands/launch.py", line 564, in launch_command
tpu_launcher(args)
File "/home/manhhung/.local/lib/python3.8/site-packages/accelerate/commands/launch.py", line 394, in tpu_launcher
xmp.spawn(PrepareForLaunch(main_function), args=(), nprocs=args.num_processes)
File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 388, in spawn
return torch.multiprocessing.start_processes(
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 139, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 2 terminated with exit code 1
manhhung@t1v-n-952231a2-w-0:~/hat/text_generation$ accelerate env
Copy-and-paste the text below in your GitHub issue
training_with_tpu.py
import os
import torch
import numpy as np
import random
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from torch.optim import AdamW
from transformers import get_scheduler
import datasets
# from text_generation.desc_dataset import DescriptionDataset
from torch.utils.data import Dataset, DataLoader
from tqdm.auto import tqdm
from accelerate import Accelerator
accelerator = Accelerator()
seed = 42
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
random.seed(seed)
class DescriptionDataset(Dataset):
def __init__(self, data):
self.data = data
def __len__(self):
return len(self.data)
def __getitem__(self, item):
input_ids = self.data[item]['input_ids']
labels = self.data[item]['labels']
attention_mask = self.data[item]['attention_mask']
num_tokens = self.data[item]['num_tokens']
return input_ids, labels, attention_mask, num_tokens
tokenizer = GPT2Tokenizer.from_pretrained("gpt2",
bos_token='<|startoftext|>',
eos_token='<|endoftext|>',
pad_token='<|pad|>')
eos_title = "<|endoftitle|>"
tokenizer.add_tokens([eos_title])
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.resize_token_embeddings(len(tokenizer))
optimizer = AdamW(model.parameters(), lr=5e-5)
def collate(batch, pad_token_id=tokenizer.pad_token_id):
data_batch = {
"input_ids": [],
"labels": [],
"attention_mask": [],
"num_tokens": []
}
for example in batch:
data_batch['input_ids'].append(example[0])
data_batch['labels'].append(example[1])
data_batch['attention_mask'].append(example[2])
data_batch['num_tokens'].append(example[3])
list_length = data_batch['num_tokens']
_bs = len(list_length)
max_len = max(list_length)
input_ids_batch = data_batch['input_ids']
labels_batch = data_batch['labels']
attention_mask_batch = data_batch['attention_mask']
for index in range(_bs):
len_padding = max_len - list_length[index]
input_ids_batch[index] += len_padding * [pad_token_id]
labels_batch[index] += len_padding * [-100]
attention_mask_batch[index] += len_padding * [0]
return {
"input_ids": torch.tensor(input_ids_batch, dtype=torch.long),
"labels": torch.tensor(labels_batch, dtype=torch.long),
"attention_mask": torch.tensor(attention_mask_batch, dtype=torch.long)
}
path_data = "/home/hungnm2/data"
train_dataset = datasets.load_from_disk(os.path.join(path_data, "train"))
train_dataset = DescriptionDataset(train_dataset)
eval_dataset = datasets.load_from_disk(os.path.join(path_data, "eval"))
eval_dataset = DescriptionDataset(eval_dataset)
batch_size = 4
train_dataloader = DataLoader(dataset=train_dataset,
batch_size=batch_size,
collate_fn=collate,
shuffle=True)
eval_dataloader = DataLoader(dataset=eval_dataset,
batch_size=16,
collate_fn=collate,
shuffle=batch_size)
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(name="linear", optimizer=optimizer, num_warmup_steps=0,
num_training_steps=num_training_steps)
# device = "cuda" if torch.cuda.is_available() else "cpu"
# model.to(device)
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(model,
optimizer,
train_dataloader,
eval_dataloader,
lr_scheduler )
def save_model(model, tokenizer, output_dir):
# Create output directory if needed
if not os.path.exists(output_dir):
os.makedirs(output_dir)
print("Saving model to %s" % output_dir)
# Save a trained model, configuration and tokenizer using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
model_to_save = model.module if hasattr(model, 'module') else model # Take care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
def eval():
model.eval()
eval_loss = 0
for batch in eval_dataloader:
with torch.no_grad():
outputs = model(**batch)
loss_batch = outputs[0].item()
eval_loss += loss_batch
eval_loss /= len(eval_dataloader)
model.train()
return eval_loss
def main():
progress_bar = tqdm(range(num_training_steps))
save_steps = int(num_training_steps / (num_epochs * 2))
logging_steps = 1000
eval_steps = int(num_training_steps / (num_epochs * 3))
print("Model is running on {}".format(device))
print("num_training_steps: ", num_training_steps)
print("save_steps: ", save_steps)
print("logging_steps: ", logging_steps)
print("eval_steps: ", eval_steps)
total_steps = 0
for epoch in range(num_epochs):
model.train()
train_loss = 0
for step, batch in enumerate(train_dataloader):
total_steps += 1
# batch = {k: v.to(device) for k, v in batch.items()}
outputs = model(**batch)
loss = outputs[0]
batch_loss = loss.item()
train_loss += batch_loss
if (total_steps + 1) % logging_steps == 0:
print("Epoch:{} & global step: {}/{}, train loss = {}".format(epoch + 1, total_steps, num_training_steps,
batch_loss))
# loss.backward()
accelerator.backward(loss)
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)
if (total_steps + 1) % eval_steps == 0:
print("evaluating...")
eval_loss = eval()
print(
"Epoch:{} & step: {}/{}, Eval loss = {}".format(epoch + 1, step, len(train_dataloader), eval_loss))
if (total_steps + 1) % save_steps == 0:
save_fn = "checkpoint-" + str(total_steps)
save_model(model, tokenizer,save_fn )
progress_bar.update(1)
train_loss /= len(train_dataloader)
eval_loss = eval()
print("Done epoch:{}, Training loss = {} and eval loss = {}".format(epoch + 1, train_loss, eval_loss))
if __name__ == '__main__':
main()
Expected behavior
I want my code able train with TPU
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I encountered the same problem
I have same problem...any solutions?
I am having the same problem!
FYI i've observed this too, running on PJRT_DEVICE, also tried on TPUv3, same issue. Worth noting that training runs on the TPUs fine when not using accelerate so deffo feels like an issue with accelerate's use of multiprocessing
Thanks for the bump all, will look at it this week