lightly icon indicating copy to clipboard operation
lightly copied to clipboard

torch._C._LinAlgError: linalg.svd: (Batch element 0): The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 15).

Open CharisWg opened this issue 1 year ago • 1 comments

C:\Users\LocalAdmin\anaconda3\envs\lightlyyolo\python.exe D:\Charis\SSL-yolo8\lightly-master\examples\pytorch\mmcr_yolo.py WARNING ⚠️ no model scale passed. Assuming scale='n'. class_name is: MMCR save_path is: D:\Charis\SSL-yolo8\lightly-master\runs\MMCR Starting Training epoch: 00, loss: -2415920191337764664519950336.00000 after training tensor([ -0.7926, -2.2815, -0.7858, -14.8213, -16.7507], device='cuda:0') tensor([ -0.7926, -2.2815, -0.7858, -14.8213, -16.7507], device='cuda:0') tensor([ -0.7926, -2.2815, -0.7858, -14.8213, -16.7507], device='cuda:0') tensor([-0.4687, -0.7416, -0.3247, -4.7035, -5.2732], device='cuda:0') after saving training + has backbone.load_state_dict tensor([-0.4687, -0.7416, -0.3247, -4.7035, -5.2732], device='cuda:0') tensor([-0.4687, -0.7416, -0.3247, -4.7035, -5.2732], device='cuda:0') tensor([-0.4687, -0.7416, -0.3247, -4.7035, -5.2732], device='cuda:0') tensor([-0.4687, -0.7416, -0.3247, -4.7035, -5.2732], device='cuda:0') save full_path is: D:\Charis\SSL-yolo8\lightly-master\runs\MMCR\MMCR_coca_alldcm_MMCRTransform.pth Saving model for MMCR_coca_alldcm_MMCRTransform.pth at Epoch 1 Finding optimal model params. Loss is dropping from -2415920191337764664519950336.0000 to -2415920191337764664519950336.0000 D:\Charis\SSL-yolo8\lightly-master\lightly\loss\mmcr_loss.py:60: UserWarning: torch.linalg.svd: During SVD computation with the selected cusolver driver, batches 0, 1, 2, 3, 4, and other 123 batches failed to converge. A more accurate method will be used to compute the SVD as a fallback. Check doc at https://pytorch.org/docs/stable/generated/torch.linalg.svd.html (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\linalg\BatchLinearAlgebraLib.cpp:703.) _, S_z, _ = svd(z) Traceback (most recent call last): File "D:\Charis\SSL-yolo8\lightly-master\examples\pytorch\mmcr_yolo.py", line 158, in loss = criterion(z_o, z_m) File "C:\Users\LocalAdmin\anaconda3\envs\lightlyyolo\lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "C:\Users\LocalAdmin\anaconda3\envs\lightlyyolo\lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) File "D:\Charis\SSL-yolo8\lightly-master\lightly\loss\mmcr_loss.py", line 60, in forward _, S_z, _ = svd(z) torch._C._LinAlgError: linalg.svd: (Batch element 0): The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 15).

Process finished with exit code 1

CharisWg avatar Jul 24 '24 11:07 CharisWg

Hi, sorry for the late reply. It looks like your loss is way too large (2415920191337764664519950336.00000). Maybe try decreasing the learning rate or check your gradient values (clip them if necessary).

guarin avatar Aug 16 '24 06:08 guarin