Open3D-ML icon indicating copy to clipboard operation
Open3D-ML copied to clipboard

Can't pickle local object while training RandLANet on S3DIS

Open kimdn opened this issue 3 years ago • 8 comments

Checklist

Describe the issue

Can't pickle local object while training RandLANet on S3DIS. I use pytorch.

Steps to reproduce the bug

import open3d.ml as _ml3d
import open3d.ml.torch as ml3d

model = ml3d.models.RandLANet()

dataset_path = "/Users/kimd999/research/projects/Danny/files/public_dataset/S3DIS/Stanford3dDataset_v1.2_Aligned_Version"
dataset = ml3d.datasets.S3DIS(dataset_path=dataset_path, use_cache=True)

pipeline = ml3d.pipelines.SemanticSegmentation(model=model, dataset=dataset, max_epoch=100)

# prints training progress in the console.
pipeline.run_train()

Error message

INFO - 2022-02-15 13:24:10,927 - semantic_segmentation - DEVICE : cpu INFO - 2022-02-15 13:24:10,927 - semantic_segmentation - Logging in file : ./logs/RandLANet_S3DIS_torch/log_train_2022-02-15_13:24:10.txt INFO - 2022-02-15 13:24:10,929 - s3dis - Found 249 pointclouds for train INFO - 2022-02-15 13:24:10,935 - s3dis - Found 23 pointclouds for validation INFO - 2022-02-15 13:24:10,937 - semantic_segmentation - Initializing from scratch. INFO - 2022-02-15 13:24:10,940 - semantic_segmentation - Writing summary in train_log/00003_RandLANet_S3DIS_torch. INFO - 2022-02-15 13:24:10,940 - semantic_segmentation - Started training INFO - 2022-02-15 13:24:10,940 - semantic_segmentation - === EPOCH 0/100 === training: 0%| | 0/63 [00:00<?, ?it/s] Traceback (most recent call last): File "train_model_for_semantic_segmentation.py", line 19, in pipeline.run_train() File "/Users/kimd999/bin/miniconda3/envs/open3d/lib/python3.8/site-packages/open3d/_ml3d/torch/pipelines/semantic_segmentation.py", line 394, in run_train for step, inputs in enumerate(tqdm(train_loader, desc='training')): File "/Users/kimd999/bin/miniconda3/envs/open3d/lib/python3.8/site-packages/tqdm/std.py", line 1180, in iter for obj in iterable: File "/Users/kimd999/bin/miniconda3/envs/open3d/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 355, in iter return self._get_iterator() File "/Users/kimd999/bin/miniconda3/envs/open3d/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 301, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "/Users/kimd999/bin/miniconda3/envs/open3d/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 914, in init w.start() File "/Users/kimd999/bin/miniconda3/envs/open3d/lib/python3.8/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) File "/Users/kimd999/bin/miniconda3/envs/open3d/lib/python3.8/multiprocessing/context.py", line 224, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "/Users/kimd999/bin/miniconda3/envs/open3d/lib/python3.8/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/Users/kimd999/bin/miniconda3/envs/open3d/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in init super().init(process_obj) File "/Users/kimd999/bin/miniconda3/envs/open3d/lib/python3.8/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/Users/kimd999/bin/miniconda3/envs/open3d/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch reduction.dump(process_obj, fp) File "/Users/kimd999/bin/miniconda3/envs/open3d/lib/python3.8/multiprocessing/reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'SemSegRandomSampler.get_point_sampler.._random_centered_gen'

Expected behavior

No response

Open3D, Python and System information

- Operating system: OSX 10.15.7
- Python version: Python 3.8.12
- Open3D version: open3d version:0.14.1
- System type: x84
- Is this remote workstation?: no
- How did you install Open3D?: pip install open3d

Additional information

No response

kimdn avatar Feb 15 '22 21:02 kimdn

I am having the exact same issue, also with RandLANet and SemanticSegmentation. I will let you know if I find the problem @kimdn .

bernhardpg avatar Feb 24 '22 18:02 bernhardpg

@kimdn Seems that this was caused somehow by num_workers in pytorch, see this thread: https://github.com/pyg-team/pytorch_geometric/issues/366.

Try setting num_workers=0 in your pipeline definition like so: pipeline = ml3d.pipelines.SemanticSegmentation(model=model, dataset=dataset, max_epoch=100, num_workers=0)

I guess it is not a great solution if you intend to have num_workers > 0, but hopefully it will at least resolve the error message!

bernhardpg avatar Feb 24 '22 18:02 bernhardpg

I used WSL ubuntu to train the models. Num_workers > 0 worked for RandLA-Net but KPConv, which was very strange. But at least it proved that multiprocessing could work in this virtual environment. Do you have any ideas about the difference in the model deployments?

maosuli avatar Apr 24 '22 07:04 maosuli

Hi @bernhardpg @LuZaiJiaoXiaL

I have set num_workers to 0, but I still met this bug. Do you know how to solve? python scripts/run_pipeline.py torch -c ml3d/configs/randlanet_toronto3d.yml --dataset.dataset_path dataset/Toronto_3D --pipeline SemanticSegmentation --dataset.use_cache True --num_workers 0

INFO - 2022-12-09 17:31:29,220 - semantic_segmentation - === EPOCH 0/200 === training: 0%| | 0/50 [00:00<?, ?it/s] Traceback (most recent call last): File "/export/home2/hanxiaobing/Documents/Open3D-ML-code/Open3D-ML/scripts/run_pipeline.py", line 246, in sys.exit(main()) File "/export/home2/hanxiaobing/Documents/Open3D-ML-code/Open3D-ML/scripts/run_pipeline.py", line 180, in main pipeline.run_train() File "/export/home2/hanxiaobing/anaconda3/envs/Open3D-ML-Pytorch/lib/python3.10/site-packages/open3d/_ml3d/torch/pipelines/semantic_segmentation.py", line 406, in run_train for step, inputs in enumerate(tqdm(train_loader, desc='training')): File "/export/home2/hanxiaobing/anaconda3/envs/Open3D-ML-Pytorch/lib/python3.10/site-packages/tqdm/std.py", line 1195, in iter for obj in iterable: File "/export/home2/hanxiaobing/anaconda3/envs/Open3D-ML-Pytorch/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 438, in iter return self._get_iterator() File "/export/home2/hanxiaobing/anaconda3/envs/Open3D-ML-Pytorch/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 384, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "/export/home2/hanxiaobing/anaconda3/envs/Open3D-ML-Pytorch/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1048, in init w.start() File "/export/home2/hanxiaobing/anaconda3/envs/Open3D-ML-Pytorch/lib/python3.10/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) File "/export/home2/hanxiaobing/anaconda3/envs/Open3D-ML-Pytorch/lib/python3.10/multiprocessing/context.py", line 224, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "/export/home2/hanxiaobing/anaconda3/envs/Open3D-ML-Pytorch/lib/python3.10/multiprocessing/context.py", line 291, in _Popen return Popen(process_obj) File "/export/home2/hanxiaobing/anaconda3/envs/Open3D-ML-Pytorch/lib/python3.10/multiprocessing/popen_forkserver.py", line 35, in init super().init(process_obj) File "/export/home2/hanxiaobing/anaconda3/envs/Open3D-ML-Pytorch/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/export/home2/hanxiaobing/anaconda3/envs/Open3D-ML-Pytorch/lib/python3.10/multiprocessing/popen_forkserver.py", line 47, in _launch reduction.dump(process_obj, buf) File "/export/home2/hanxiaobing/anaconda3/envs/Open3D-ML-Pytorch/lib/python3.10/multiprocessing/reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'SemSegRandomSampler.get_point_sampler.._random_centered_gen' [W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

whuhxb avatar Dec 09 '22 05:12 whuhxb

I set num_workers to 0 in run_pipline.py, the same error still happens. image

ted8201 avatar Jan 16 '23 08:01 ted8201

Hey, any new insights into how to fix the problem? I just ran into the same issue on a dockerized ubuntu20.04 with cudnn11.7.

Happy, if someone could share their latest fixes

RauchLukas avatar May 02 '23 16:05 RauchLukas

Same problem here. Very keen to get this fixed if I can.

Thanks

runra avatar Sep 27 '23 01:09 runra

Hi, I find a solution. Just add "num_workers:0 pin_memory: false" belong to "pipeline" in ".yaml " config file. Solution link https://blog.csdn.net/weixin_40653140/article/details/130492849

DCtcl avatar Feb 28 '24 02:02 DCtcl