HAT icon indicating copy to clipboard operation
HAT copied to clipboard

在多卡并行计算,用DF2K数据集复现x2超分的时候报了UserWarning: Grad strides do not match bucket view strides.请问大佬们有遇到这个问题吗?会影响最终结果吗?

Open VMC-Lab-Chen opened this issue 2 years ago • 5 comments

大佬们好,请问一下各位大佬们在进行多卡训练的时候会报以下UserWarning吗?这个UserWarning会影响最终的结果吗? (我用的是2x4090进行训练。)

/root/miniconda3/lib/python3.8/site-packages/torch/autograd/init.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [180, 6, 1, 1], strides() = [6, 1, 6, 6] bucket_view.sizes() = [180, 6, 1, 1], strides() = [6, 1, 1, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /root/miniconda3/lib/python3.8/site-packages/torch/autograd/init.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [180, 6, 1, 1], strides() = [6, 1, 6, 6] bucket_view.sizes() = [180, 6, 1, 1], strides() = [6, 1, 1, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass

VMC-Lab-Chen avatar Nov 01 '23 05:11 VMC-Lab-Chen

@Superfish666 应该不影响。pytorch版本的问题。

chxy95 avatar Nov 01 '23 07:11 chxy95

@Superfish666 应该不影响。pytorch版本的问题。

好的谢谢大佬!

VMC-Lab-Chen avatar Nov 05 '23 12:11 VMC-Lab-Chen

@chxy95 @Superfish666 大佬方便分享下多卡训练的环境配置么?比如CUDA、Pytorch、Python和transformer版本。感谢

qazwsx042 avatar Nov 27 '23 01:11 qazwsx042

Is there any advancements here? I have worse performance after fine-tuning using my data. I think this can cause the issue. I have changed the GPU number to 1 and batch size to 1. I have added contigious to the lines that have permute and transpose. What should I do more?

abdullahbas avatar May 19 '24 13:05 abdullahbas

Who has the same problem just add contigious to all lines that have transpose and permute but contigious. Then reinstall hat with "pip install -e .". This is it now there is no warning.

abdullahbas avatar May 19 '24 13:05 abdullahbas