Unexpected segmentation fault encountered in DataLoader workers with `num_workers > 0` when training model
Describe the bug
Bugs
I encounter the following error when using DataLoader workers with num_workers > 0 when training the model. I notice that it only runs when set num_workers = 0 but the CPU usage is 100%.
Environment
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
Name: torch
Version: 1.13.0+cu117
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3
Location: /home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages
Requires: typing-extensions
Required-by: pytorch-lightning, torchmetrics, torchvision
Name: flwr
Version: 1.4.0
Summary: Flower: A Friendly Federated Learning Framework
Home-page: https://flower.dev
Author: The Flower Authors
Author-email: [email protected]
License: Apache-2.0
Location: /home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages
Requires: grpcio, iterators, numpy, protobuf
Required-by:
Steps/Code to Reproduce
training.py
for ecgs, tgts in tqdm.tqdm(fit_loaders["fit"]):
ecgs, tgts = ecgs.cuda(), tgts.cuda()
logits = client_model(ecgs)
loss = F.binary_cross_entropy_with_logits(logits, tgts)
loss.backward()
optimizer.step(), optimizer.zero_grad()
running_loss = running_loss + loss.item()*ecgs.size(0)
tgts, prds = list(tgts.data.cpu().numpy()), list(np.where(torch.sigmoid(logits).detach().cpu().numpy() >= 0.5, 1.0, 0.0))
running_tgts.extend(tgts), running_prds.extend(prds)
fit_loss, fit_f1 = running_loss/len(fit_loaders["fit"].dataset), metrics.f1_score(
running_tgts, running_prds
, average = "macro"
)
print(
"fit_loss:{:.4f}".format(fit_loss), "fit_f1:{:.4f}".format(fit_f1)
)
Expected Results
Training well when set num_worker > 0
Actual Results
0%| | 0/1976 [00:00<?, ?it/s]ERROR: Unexpected segmentation fault encountered in worker.
0%| | 0/1976 [00:00<?, ?it/s]
DEBUG flwr 2023-07-28 04:11:25,363 | connection.py:113 | gRPC channel closed
wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/ubuntu/long.ht/FL-ECG/source/tools/wandb/offline-run-20230728_041115-1ydn6b34
wandb: Find logs at: ./wandb/offline-run-20230728_041115-1ydn6b34/logs
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1011, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/multiprocessing/queues.py", line 107, in get
if not self._poll(timeout):
File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
r = wait([self], timeout)
File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/multiprocessing/connection.py", line 931, in wait
ready = selector.select(timeout)
File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 485736) is killed by signal: Segmentation fault.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "client.py", line 117, in <module>
fl.client.start_numpy_client(
File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages/flwr/client/app.py", line 252, in start_numpy_client
start_client(
File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages/flwr/client/app.py", line 178, in start_client
client_message, sleep_duration, keep_going = handle(
File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages/flwr/client/message_handler/message_handler.py", line 67, in handle
return _fit(client, server_msg.fit_ins), 0, True
File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages/flwr/client/message_handler/message_handler.py", line 126, in _fit
fit_res = maybe_call_fit(
File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages/flwr/client/client.py", line 184, in maybe_call_fit
return client.fit(fit_ins)
File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages/flwr/client/app.py", line 297, in _fit
results = self.numpy_client.fit(parameters, ins.config) # type: ignore
File "client.py", line 49, in fit
client_results = client_fit_fn(
File "/home/ubuntu/long.ht/FL-ECG/source/engines.py", line 18, in client_fit_fn
for ecgs, tgts in tqdm.tqdm(fit_loaders["fit"]):
File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages/tqdm/std.py", line 1178, in __iter__
for obj in iterable:
File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
data = self._next_data()
File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1207, in _next_data
idx, data = self._get_data()
File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1173, in _get_data
success, data = self._try_get_data()
File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1024, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 485736) exited unexpectedly
I have encountered the same problem and finally, I avoided this problem by running simulation.
-
When training with
num_workers=0, the client process will use all cpu cores (no matter how many cores you have). In turn, this can cause the DataLoader to read data too slowly, slowing down the training process (my cpu has 64 cores and loading data takes 20s for 10 clients) . -
But when you specify the
num_workers>0, you will also encounter the above error for a subset of clients (e.g., I have 10 clients and most of time half the clients quit because of this error. -
The problem doesn't seem to be related to Pytorch and CUDA version, I tried the following combinations in different machines and all had this problem:
- flwr 1.4, pytorch 2.0.1 + cu118
- flwr 1.5, pytorch 2.1.0 + cu118
- flwr 1.5, pytorch 2.1.0 + cu121
- flwr 1.5, pytorch 2.1.1 + cu121
我遇到了同样的问题,最后,我通过运行模拟避免了这个问题。
使用 进行训练时,客户端进程将使用所有 cpu 内核(无论您有多少个内核)。反过来,这可能会导致 DataLoader 读取数据的速度太慢,从而减慢训练过程(我的 cpu 有 64 个内核,10 个客户端加载数据需要 20 秒)。
num_workers=0但是,当您指定 时,您还将遇到客户端子集的上述错误(例如,我有 10 个客户端,大多数情况下有一半的客户端因为这个错误而退出。
num_workers>0该问题似乎与 Pytorch 和 CUDA 版本无关,我在不同的机器上尝试了以下组合,但都有这个问题:
- FLWR 1.4、PyTorch 2.0.1 + CU118
- FLWR 1.5、PyTorch 2.1.0 + CU118
- FLWR 1.5、PyTorch 2.1.0 + CU121
- FLWR 1.5、PyTorch 2.1.1 + CU121
The client side of your solution should only be able to use the same dataset, is there any other way to solve this?
我遇到了同样的问题,最后,我通过运行模拟避免了这个问题。
使用 进行训练时,客户端进程将使用所有 cpu 内核(无论您有多少个内核)。反过来,这可能会导致 DataLoader 读取数据的速度太慢,从而减慢训练过程(我的 cpu 有 64 个内核,10 个客户端加载数据需要 20 秒)。
num_workers=0但是,当您指定 时,您还将遇到客户端子集的上述错误(例如,我有 10 个客户端,大多数情况下有一半的客户端因为这个错误而退出。
num_workers>0该问题似乎与 Pytorch 和 CUDA 版本无关,我在不同的机器上尝试了以下组合,但都有这个问题:
- FLWR 1.4、PyTorch 2.0.1 + CU118
- FLWR 1.5、PyTorch 2.1.0 + CU118
- FLWR 1.5、PyTorch 2.1.0 + CU121
- FLWR 1.5、PyTorch 2.1.1 + CU121
The client side of your solution should only be able to use the same dataset, is there any other way to solve this?
It seems that there are no other way. But indeed we can use different dataset for different clients:
def client_fn(cid: str):
# Return a standard Flower client
return MyFlowerClient().to_client()
When running simulation, we should specify client_fn, we can load different dataset according to different cid.
A possible workaround is to start a debugpy adapter before the initialization of the server and clients:
debugpy.listen(("localhost", args.tmp_dbg_port))
if args.role == "client":
start_client(...)
else:
start_server(...)
Thanks for raising this.
Is this issue still something that you are experiencing or has it been solved by newer ray or flwr versions?
This seems to have been resolved by the flwr-datasets. Thus, I am closing this issue for now.
This seems to have been resolved by the flwr-datasets. Thus, I am closing this issue for now.
Thank you very much for your efforts, I'm still using an older version of the Flwr framework, so I can't check if the bug still exists. I will still use the Flwr framework for future projects and test if this bug still exists.