wenet icon indicating copy to clipboard operation
wenet copied to clipboard

Adapt for pytorch version >= 1.9

Open ksellesk opened this issue 3 years ago • 14 comments

ksellesk avatar Aug 06 '21 06:08 ksellesk

Can you describe the problem when you use the old code with Pytorch 1.9

robin1001 avatar Aug 09 '21 01:08 robin1001

i tried a lot, but all fails.

then i chage my code like this sample:

https://github.com/pytorch/examples/blob/master/imagenet/main.py

and it works!

ksellesk avatar Aug 09 '21 02:08 ksellesk

what's the error message?

robin1001 avatar Aug 09 '21 02:08 robin1001

Traceback (most recent call last): File "wenet/bin/train.py", line 199, in Traceback (most recent call last): File "wenet/bin/train.py", line 199, in model, find_unused_parameters=True) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 496, in init model, find_unused_parameters=True) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 496, in init dist._verify_model_across_ranks(self.process_group, parameters) dist._verify_model_across_ranks(self.process_group, parameters) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1623448265233/work/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled cuda error, NCCL version 2.7.8 ncclUnhandledCudaError: Call to CUDA function failed.RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1623448265233/work/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandl ed cuda error, NCCL version 2.7.8 ncclUnhandledCudaError: Call to CUDA function failed.

. .

our machine supports nccl.

ksellesk avatar Aug 10 '21 08:08 ksellesk

Any progress?

Has anyone tried orch version >= 1.9?

ksellesk avatar Aug 12 '21 10:08 ksellesk

Any progress?

Has anyone tried orch version >= 1.9?

GPU 1080 ti and TITAN RTX has no problem in PyTorch 1.9.0, but 2080Ti has the same problem.

TeaPoly avatar Aug 13 '21 06:08 TeaPoly

Traceback (most recent call last): File "wenet/bin/train.py", line 199, in Traceback (most recent call last): File "wenet/bin/train.py", line 199, in model, find_unused_parameters=True) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 496, in init model, find_unused_parameters=True) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 496, in init dist._verify_model_across_ranks(self.process_group, parameters) dist._verify_model_across_ranks(self.process_group, parameters) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1623448265233/work/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled cuda error, NCCL version 2.7.8 ncclUnhandledCudaError: Call to CUDA function failed.RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1623448265233/work/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandl ed cuda error, NCCL version 2.7.8 ncclUnhandledCudaError: Call to CUDA function failed.

. .

our machine supports nccl.

What is the graphics card you are using?

TeaPoly avatar Aug 13 '21 10:08 TeaPoly

Traceback (most recent call last): File "wenet/bin/train.py", line 199, in Traceback (most recent call last): File "wenet/bin/train.py", line 199, in model, find_unused_parameters=True) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 496, in init model, find_unused_parameters=True) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 496, in init dist._verify_model_across_ranks(self.process_group, parameters) dist._verify_model_across_ranks(self.process_group, parameters) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1623448265233/work/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled cuda error, NCCL version 2.7.8 ncclUnhandledCudaError: Call to CUDA function failed.RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1623448265233/work/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandl ed cuda error, NCCL version 2.7.8 ncclUnhandledCudaError: Call to CUDA function failed. . . our machine supports nccl.

What is the graphics card you are using?

3090

ksellesk avatar Aug 16 '21 02:08 ksellesk

Traceback (most recent call last): File "wenet/bin/train.py", line 199, in Traceback (most recent call last): File "wenet/bin/train.py", line 199, in model, find_unused_parameters=True) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 496, in init model, find_unused_parameters=True) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 496, in init dist._verify_model_across_ranks(self.process_group, parameters) dist._verify_model_across_ranks(self.process_group, parameters) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1623448265233/work/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled cuda error, NCCL version 2.7.8 ncclUnhandledCudaError: Call to CUDA function failed.RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1623448265233/work/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandl ed cuda error, NCCL version 2.7.8 ncclUnhandledCudaError: Call to CUDA function failed. . . our machine supports nccl.

What is the graphics card you are using?

3090

I occurs same error on the nvidia RTX 3080 in docker environment.

shanguanma avatar Aug 16 '21 03:08 shanguanma

any update now?

robin1001 avatar Aug 17 '21 09:08 robin1001

@robin1001 . Util now, I found that one way was work for me in the pytorch1.8 or pytorch1.9, I use the command docker run -it --rm --gpus '"device=1,2"' --shm-size=1g --ulimit memlock=-1 --ipc=host -v /mnt/4T:/mnt/4T -v /home/maduo/docker_asr_wenet_nvidia_gpu/wenet:/code/w2021/wenet asr_wenet_nvidia_gpu:latest to run wenet, but in the run.sh I must be set ``export CUDA_VISIBLE_DEVICES="0,1" If I increase number of gpus, it will occur the above error.

shanguanma avatar Aug 17 '21 11:08 shanguanma

Thank you! I met the same error. This solution works for me with the environment below: pyTorch: 1.9.0+cu111 GPU: 3080Ti * 2 CUDA: 11.1 OS: ubuntu 18.04 Driver: 460.91.03

cnrpman avatar Aug 27 '21 06:08 cnrpman

Any progress?

ksellesk avatar Aug 30 '21 03:08 ksellesk

Any progress?

Seems the fix is working weird when loading checkpoints.. I attached my details in https://github.com/wenet-e2e/wenet/discussions/586#discussioncomment-1345310

cnrpman avatar Sep 17 '21 02:09 cnrpman