efficient-kan icon indicating copy to clipboard operation
efficient-kan copied to clipboard

Intel oneMKL ERROR: Parameter 6 was incorrect on entry to SGELSY.

Open wza13 opened this issue 9 months ago • 7 comments

D:\Users\12719\anaconda3\python.exe D:\Users\12719\PycharmProjects\efficient-kan\tests\test_simple_math.py 20%|██ | 20/100 [00:01<00:06, 12.66it/s, mse_loss=nan, reg_loss=nan] Intel oneMKL ERROR: Parameter 6 was incorrect on entry to SGELSY.

Intel oneMKL ERROR: Parameter 6 was incorrect on entry to SGELSY. 20%|██ | 20/100 [00:02<00:08, 9.82it/s, mse_loss=nan, reg_loss=nan] Traceback (most recent call last): File "D:\Users\12719\PycharmProjects\efficient-kan\tests\test_simple_math.py", line 35, in test_mul() File "D:\Users\12719\PycharmProjects\efficient-kan\tests\test_simple_math.py", line 29, in test_mul optimizer.step(closure) File "D:\Users\12719\anaconda3\Lib\site-packages\torch\optim\optimizer.py", line 459, in wrapper out = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\optim\lbfgs.py", line 320, in step orig_loss = closure() ^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\PycharmProjects\efficient-kan\tests\test_simple_math.py", line 18, in closure y = kan(x, update_grid=(i % 20 == 0)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1541, in call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\PycharmProjects\efficient-kan\src\efficient_kan\kan.py", line 272, in forward layer.update_grid(x) File "D:\Users\12719\anaconda3\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\PycharmProjects\efficient-kan\src\efficient_kan\kan.py", line 210, in update_grid self.spline_weight.data.copy(self.curve2coeff(x, unreduced_spline_output)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\PycharmProjects\efficient-kan\src\efficient_kan\kan.py", line 131, in curve2coeff solution = torch.linalg.lstsq( ^^^^^^^^^^^^^^^^^^^ RuntimeError: false INTERNAL ASSERT FAILED at "C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\BatchLinearAlgebra.cpp":1538, please report a bug to PyTorch. torch.linalg.lstsq: (Batch element 0): Argument 6 has illegal value. Most certainly there is a bug in the implementation calling the backend library.

wza13 avatar May 13 '24 09:05 wza13

Hi bro, do you solve this problem? I have the same output when running the test_simple_math.py.

LIWEIDENG0830 avatar May 13 '24 12:05 LIWEIDENG0830

D:\Users\12719\anaconda3\python.exe D:\Users\12719\PycharmProjects\efficient-kan\tests\test_simple_math.py 20%|██ | 20/100 [00:01<00:06, 12.66it/s, mse_loss=nan, reg_loss=nan] Intel oneMKL ERROR: Parameter 6 was incorrect on entry to SGELSY.

Intel oneMKL ERROR: Parameter 6 was incorrect on entry to SGELSY. 20%|██ | 20/100 [00:02<00:08, 9.82it/s, mse_loss=nan, reg_loss=nan] Traceback (most recent call last): File "D:\Users\12719\PycharmProjects\efficient-kan\tests\test_simple_math.py", line 35, in test_mul() File "D:\Users\12719\PycharmProjects\efficient-kan\tests\test_simple_math.py", line 29, in test_mul optimizer.step(closure) File "D:\Users\12719\anaconda3\Lib\site-packages\torch\optim\optimizer.py", line 459, in wrapper out = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\optim\lbfgs.py", line 320, in step orig_loss = closure() ^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\PycharmProjects\efficient-kan\tests\test_simple_math.py", line 18, in closure y = kan(x, update_grid=(i % 20 == 0)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1541, in call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\PycharmProjects\efficient-kan\src\efficient_kan\kan.py", line 272, in forward layer.update_grid(x) File "D:\Users\12719\anaconda3\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\PycharmProjects\efficient-kan\src\efficient_kan\kan.py", line 210, in update_grid self.spline_weight.data.copy(self.curve2coeff(x, unreduced_spline_output)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\PycharmProjects\efficient-kan\src\efficient_kan\kan.py", line 131, in curve2coeff solution = torch.linalg.lstsq( ^^^^^^^^^^^^^^^^^^^ RuntimeError: false INTERNAL ASSERT FAILED at "C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\BatchLinearAlgebra.cpp":1538, please report a bug to PyTorch. torch.linalg.lstsq: (Batch element 0): Argument 6 has illegal value. Most certainly there is a bug in the implementation calling the backend library.

Sounds like https://github.com/KindXiaoming/pykan/issues/170. changing driver in code may help.

Indoxer avatar May 13 '24 17:05 Indoxer

D:\Users\12719\anaconda3\python.exe D:\Users\12719\PycharmProjects\efficient-kan\tests\test_simple_math.py 20%|██ | 20/100 [00:01<00:06, 12.66it/s, mse_loss=nan, reg_loss=nan] Intel oneMKL ERROR: Parameter 6 was incorrect on entry to SGELSY. Intel oneMKL ERROR: Parameter 6 was incorrect on entry to SGELSY. 20%|██ | 20/100 [00:02<00:08, 9.82it/s, mse_loss=nan, reg_loss=nan] Traceback (most recent call last): File "D:\Users\12719\PycharmProjects\efficient-kan\tests\test_simple_math.py", line 35, in test_mul() File "D:\Users\12719\PycharmProjects\efficient-kan\tests\test_simple_math.py", line 29, in test_mul optimizer.step(closure) File "D:\Users\12719\anaconda3\Lib\site-packages\torch\optim\optimizer.py", line 459, in wrapper out = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\optim\lbfgs.py", line 320, in step orig_loss = closure() ^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\PycharmProjects\efficient-kan\tests\test_simple_math.py", line 18, in closure y = kan(x, update_grid=(i % 20 == 0)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1541, in call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\PycharmProjects\efficient-kan\src\efficient_kan\kan.py", line 272, in forward layer.update_grid(x) File "D:\Users\12719\anaconda3\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\PycharmProjects\efficient-kan\src\efficient_kan\kan.py", line 210, in update_grid self.spline_weight.data.copy(self.curve2coeff(x, unreduced_spline_output)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\PycharmProjects\efficient-kan\src\efficient_kan\kan.py", line 131, in curve2coeff solution = torch.linalg.lstsq( ^^^^^^^^^^^^^^^^^^^ RuntimeError: false INTERNAL ASSERT FAILED at "C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\BatchLinearAlgebra.cpp":1538, please report a bug to PyTorch. torch.linalg.lstsq: (Batch element 0): Argument 6 has illegal value. Most certainly there is a bug in the implementation calling the backend library.

Sounds like KindXiaoming/pykan#170. changing driver in code may help.

Hi Indoxer, thanks for your kind help! It looks like the same problem with in pykan. However, I try to change the driver in lstsq as solution = torch.linalg.lstsq( A, B, driver='gelsy' ).solution and run on CPU. It does not work in my situation.

LIWEIDENG0830 avatar May 14 '24 03:05 LIWEIDENG0830

Hi bro, do you solve this problem? I have the same output when running the test_simple_math.py.

This is because the learning rate is too high(lr = 1) in that example and B turns to Nan in learning. Try to turn it lower may help you fix it.

Xu-backup avatar May 14 '24 04:05 Xu-backup

Hi bro, do you solve this problem? I have the same output when running the test_simple_math.py.

This is because the learning rate is too high(lr = 1) in that example and B turns to Nan in learning. Try to turn it lower may help you fix it.

Okkkk. Thanks Xu. It works!

LIWEIDENG0830 avatar May 14 '24 04:05 LIWEIDENG0830

Hi bro, do you solve this problem? I have the same output when running the test_simple_math.py.

This is because the learning rate is too high(lr = 1) in that example and B turns to Nan in learning. Try to turn it lower may help you fix it.

the above error happened when updating the grid, so how is this related to the explosion of B?

boxaio avatar May 14 '24 13:05 boxaio

Hi bro, do you solve this problem? I have the same output when running the test_simple_math.py.

This is because the learning rate is too high(lr = 1) in that example and B turns to Nan in learning. Try to turn it lower may help you fix it.

the above error happened when updating the grid, so how is this related to the explosion of B?

I am not actually find why it happend. But i find B = y.transpose(0, 1) in the code, firstly y turns nan, so it maybe some places have been divided by a number close to 0. Because in high lr you may easily get a abnormal param.

Xu-backup avatar May 14 '24 14:05 Xu-backup