flower icon indicating copy to clipboard operation
flower copied to clipboard

Unexpected segmentation fault encountered in DataLoader workers with `num_workers > 0` when training model

Open HoTuanLong opened this issue 2 years ago • 4 comments

Describe the bug

Bugs

I encounter the following error when using DataLoader workers with num_workers > 0 when training the model. I notice that it only runs when set num_workers = 0 but the CPU usage is 100%.

Environment

NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

Name: torch
Version: 1.13.0+cu117
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3
Location: /home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages
Requires: typing-extensions
Required-by: pytorch-lightning, torchmetrics, torchvision

Name: flwr
Version: 1.4.0
Summary: Flower: A Friendly Federated Learning Framework
Home-page: https://flower.dev
Author: The Flower Authors
Author-email: [email protected]
License: Apache-2.0
Location: /home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages
Requires: grpcio, iterators, numpy, protobuf
Required-by: 


Steps/Code to Reproduce

training.py

for ecgs, tgts in tqdm.tqdm(fit_loaders["fit"]):
            ecgs, tgts = ecgs.cuda(), tgts.cuda()

            logits = client_model(ecgs)
            loss = F.binary_cross_entropy_with_logits(logits, tgts)
            loss.backward()
            
            optimizer.step(), optimizer.zero_grad()
            running_loss = running_loss + loss.item()*ecgs.size(0)
            
            tgts, prds = list(tgts.data.cpu().numpy()), list(np.where(torch.sigmoid(logits).detach().cpu().numpy() >= 0.5, 1.0, 0.0))
            running_tgts.extend(tgts), running_prds.extend(prds)

        fit_loss, fit_f1 = running_loss/len(fit_loaders["fit"].dataset), metrics.f1_score(
            running_tgts, running_prds
            , average = "macro"
        )
        print(
            "fit_loss:{:.4f}".format(fit_loss), "fit_f1:{:.4f}".format(fit_f1)
        )

Expected Results

Training well when set num_worker > 0

Actual Results

  0%|          | 0/1976 [00:00<?, ?it/s]ERROR: Unexpected segmentation fault encountered in worker.
  0%|          | 0/1976 [00:00<?, ?it/s]
DEBUG flwr 2023-07-28 04:11:25,363 | connection.py:113 | gRPC channel closed
wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/ubuntu/long.ht/FL-ECG/source/tools/wandb/offline-run-20230728_041115-1ydn6b34
wandb: Find logs at: ./wandb/offline-run-20230728_041115-1ydn6b34/logs
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1011, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/multiprocessing/queues.py", line 107, in get
    if not self._poll(timeout):
  File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
    r = wait([self], timeout)
  File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 485736) is killed by signal: Segmentation fault. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "client.py", line 117, in <module>
    fl.client.start_numpy_client(
  File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages/flwr/client/app.py", line 252, in start_numpy_client
    start_client(
  File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages/flwr/client/app.py", line 178, in start_client
    client_message, sleep_duration, keep_going = handle(
  File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages/flwr/client/message_handler/message_handler.py", line 67, in handle
    return _fit(client, server_msg.fit_ins), 0, True
  File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages/flwr/client/message_handler/message_handler.py", line 126, in _fit
    fit_res = maybe_call_fit(
  File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages/flwr/client/client.py", line 184, in maybe_call_fit
    return client.fit(fit_ins)
  File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages/flwr/client/app.py", line 297, in _fit
    results = self.numpy_client.fit(parameters, ins.config)  # type: ignore
  File "client.py", line 49, in fit
    client_results = client_fit_fn(
  File "/home/ubuntu/long.ht/FL-ECG/source/engines.py", line 18, in client_fit_fn
    for ecgs, tgts in tqdm.tqdm(fit_loaders["fit"]):
  File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1207, in _next_data
    idx, data = self._get_data()
  File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1173, in _get_data
    success, data = self._try_get_data()
  File "/home/ubuntu/miniconda3/envs/ecg/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1024, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 485736) exited unexpectedly

HoTuanLong avatar Jul 28 '23 04:07 HoTuanLong

I have encountered the same problem and finally, I avoided this problem by running simulation.

  • When training with num_workers=0, the client process will use all cpu cores (no matter how many cores you have). In turn, this can cause the DataLoader to read data too slowly, slowing down the training process (my cpu has 64 cores and loading data takes 20s for 10 clients) .

  • But when you specify the num_workers>0, you will also encounter the above error for a subset of clients (e.g., I have 10 clients and most of time half the clients quit because of this error.

  • The problem doesn't seem to be related to Pytorch and CUDA version, I tried the following combinations in different machines and all had this problem:

    • flwr 1.4, pytorch 2.0.1 + cu118
    • flwr 1.5, pytorch 2.1.0 + cu118
    • flwr 1.5, pytorch 2.1.0 + cu121
    • flwr 1.5, pytorch 2.1.1 + cu121

Phoenix-Shen avatar Nov 23 '23 09:11 Phoenix-Shen

我遇到了同样的问题,最后,我通过运行模拟避免了这个问题。

  • 使用 进行训练时,客户端进程将使用所有 cpu 内核(无论您有多少个内核)。反过来,这可能会导致 DataLoader 读取数据的速度太慢,从而减慢训练过程(我的 cpu 有 64 个内核,10 个客户端加载数据需要 20 秒)。num_workers=0

  • 但是,当您指定 时,您还将遇到客户端子集的上述错误(例如,我有 10 个客户端,大多数情况下有一半的客户端因为这个错误而退出。num_workers>0

  • 该问题似乎与 Pytorch 和 CUDA 版本无关,我在不同的机器上尝试了以下组合,但都有这个问题:

    • FLWR 1.4、PyTorch 2.0.1 + CU118
    • FLWR 1.5、PyTorch 2.1.0 + CU118
    • FLWR 1.5、PyTorch 2.1.0 + CU121
    • FLWR 1.5、PyTorch 2.1.1 + CU121

The client side of your solution should only be able to use the same dataset, is there any other way to solve this?

CHENxx23 avatar May 29 '24 07:05 CHENxx23

我遇到了同样的问题,最后,我通过运行模拟避免了这个问题。

  • 使用 进行训练时,客户端进程将使用所有 cpu 内核(无论您有多少个内核)。反过来,这可能会导致 DataLoader 读取数据的速度太慢,从而减慢训练过程(我的 cpu 有 64 个内核,10 个客户端加载数据需要 20 秒)。num_workers=0

  • 但是,当您指定 时,您还将遇到客户端子集的上述错误(例如,我有 10 个客户端,大多数情况下有一半的客户端因为这个错误而退出。num_workers>0

  • 该问题似乎与 Pytorch 和 CUDA 版本无关,我在不同的机器上尝试了以下组合,但都有这个问题:

    • FLWR 1.4、PyTorch 2.0.1 + CU118
    • FLWR 1.5、PyTorch 2.1.0 + CU118
    • FLWR 1.5、PyTorch 2.1.0 + CU121
    • FLWR 1.5、PyTorch 2.1.1 + CU121

The client side of your solution should only be able to use the same dataset, is there any other way to solve this?

It seems that there are no other way. But indeed we can use different dataset for different clients:

def client_fn(cid: str):
    # Return a standard Flower client
    return MyFlowerClient().to_client()

When running simulation, we should specify client_fn, we can load different dataset according to different cid.

Phoenix-Shen avatar May 29 '24 07:05 Phoenix-Shen

A possible workaround is to start a debugpy adapter before the initialization of the server and clients:

debugpy.listen(("localhost", args.tmp_dbg_port))
if args.role == "client":
    start_client(...)
else:
    start_server(...)

ztech avatar Jun 14 '24 09:06 ztech

Thanks for raising this.

Is this issue still something that you are experiencing or has it been solved by newer ray or flwr versions?

WilliamLindskog avatar Dec 09 '24 18:12 WilliamLindskog

This seems to have been resolved by the flwr-datasets. Thus, I am closing this issue for now.

WilliamLindskog avatar Dec 10 '24 18:12 WilliamLindskog

This seems to have been resolved by the flwr-datasets. Thus, I am closing this issue for now.

Thank you very much for your efforts, I'm still using an older version of the Flwr framework, so I can't check if the bug still exists. I will still use the Flwr framework for future projects and test if this bug still exists.

Phoenix-Shen avatar Dec 11 '24 08:12 Phoenix-Shen