pykan icon indicating copy to clipboard operation
pykan copied to clipboard

M1 runtime fails with "AssertionError: Torch not compiled with CUDA enabled"

Open rmrfxyz opened this issue 9 months ago • 14 comments

Hi! Thanks a lot for the awesome paper and implementation!

I can't get it to run on my M1 machine. I built pytorch from source, with disabled CUDA options, as per I tried setting device = "cpu" and poked around randomly but I always get the same error while trying to run the examples:

AssertionError                            Traceback (most recent call last)
Cell In[1], line 6
      2 import torch
      4 # device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
----> 6 model = KAN(width=[2,3,2,1], device='cpu')
      8 x = torch.normal(0,1,size=(100,2))

File [~/conda/envs/pykan-env/lib/python3.9/site-packages/kan/](, in KAN.__init__(self, width, grid, k, noise_scale, noise_scale_base, base_fun, symbolic_enabled, bias_trainable, grid_eps, grid_range, sp_trainable, sb_trainable, device, seed)
    137 for l in range(self.depth):
    138     # splines
    139     scale_base = 1 [/]( np.sqrt(width[l]) + (torch.randn(width[l] * width[l + 1], ) * 2 - 1) * noise_scale_base
--> 140     sp_batch = KANLayer(in_dim=width[l], out_dim=width[l + 1], num=grid, k=k, noise_scale=noise_scale, scale_base=scale_base, scale_sp=1., base_fun=base_fun, grid_eps=grid_eps, grid_range=grid_range, sp_trainable=sp_trainable,
    141                         sb_trainable=sb_trainable, device=device)
    142     self.act_fun.append(sp_batch)
    144     # bias

File [~/conda/envs/pykan-env/lib/python3.9/site-packages/kan/](, in KANLayer.__init__(self, in_dim, out_dim, num, k, noise_scale, scale_base, scale_sp, base_fun, grid_eps, grid_range, sp_trainable, sb_trainable, device)
    124     self.scale_base = torch.nn.Parameter(torch.ones(size, device=device) * scale_base).requires_grad_(sb_trainable)  # make scale trainable
    125 else:
--> 126     self.scale_base = torch.nn.Parameter(torch.FloatTensor(scale_base).cuda()).requires_grad_(sb_trainable)
    127 self.scale_sp = torch.nn.Parameter(torch.ones(size, device=device) * scale_sp).requires_grad_(sp_trainable)  # make scale trainable
    128 self.base_fun = base_fun
    286     raise AssertionError(
    287         "libcudart functions unavailable. It looks like you have a broken build?"
    288     )

AssertionError: Torch not compiled with CUDA enabled

What am I missing 🤔

rmrfxyz avatar May 07 '24 07:05 rmrfxyz

You should put into the requirements torch==2.3.0+cu121 or whatever cuda version you need.

AlessandroFlati avatar May 07 '24 08:05 AlessandroFlati

Actually, latest master version is bugged, without @KindXiaoming

AlessandroFlati avatar May 07 '24 08:05 AlessandroFlati

But pytorch is built locally and not installed through requirements.txt, as that fails on M1 since there is no CUDA available. So I built it from source and installed it in conda env separately.

I find it confusing that the error says "torch NOT compiled with CUDA", since I have to explicitly disable those options before building - otherwise it fails to install.

So I'm thinking maybe the failure is in the pytorch build, not in pykan... Maybe? I'll try to fiddle with the makefile, maybe I'm overlooking something there.

rmrfxyz avatar May 07 '24 08:05 rmrfxyz

Hi! Yesterday I was able to run in M1 Max chip with the following versions (on anaconda environment) Name Version Build Channel torch 2.3.0 pypi_0 pypi torchaudio 2.3.0 pypi_0 pypi torchvision 0.18.0 pypi_0 pypi It is extremely slow compared with also CPU version in windows. Idk if it makes any difference but I do not send the model via torch just this lines: kan_model = KAN(width=[2, 1, grid_size * grid_size], grid=2, k=3, seed=0) kan_model.train(my_ds, opt="LBFGS", steps=2, lamb=0.01, lamb_entropy=10.)

gonzalalGFM avatar May 07 '24 08:05 gonzalalGFM

As you see, the problem stands in line self.scale_base = torch.nn.Parameter(torch.FloatTensor(scale_base).cuda()).requires_grad_(sb_trainable) which in a previous (bad) tentative of allowing people to use CUDA, forced the parameter to be on cuda. You can edit that line yourself if you just want to use CPU, but we should really just wait for the PR to be accepted.

AlessandroFlati avatar May 07 '24 08:05 AlessandroFlati

Also, I'm unable to run any KAN model in GPU. I send to device (cuda) both the dataset and the model but keeps giving me this error: device = torch.device("cuda") dataset = {} dataset["train_input"] = torch.from_numpy(np.array(X_train)) dataset["test_input"] = torch.from_numpy(np.array(X_test)) dataset["train_label"] = torch.from_numpy(np.array(Y_train)) dataset["test_label"] =torch.from_numpy(np.array(Y_test)) for key, value in dataset.items(): dataset[key] = dataset[key].to(device) kan_model = KAN(width=[2, 1, grid_size * grid_size], grid=3, k=3, seed=0, device = device) kan_model.train(dataset, opt="LBFGS", steps=50, lamb=0, lamb_entropy=0, device = device)

Error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

gonzalalGFM avatar May 07 '24 08:05 gonzalalGFM

@AlessandroFlati I tried that to change that to mps but didn't work. (didn't expect it to...) Idk pretty far out of my comfort zone, tbh. Alright, glad to hear a PR is in the pipeline, I'll wait for that. Thanks!

@gonzalalGFM Cheers! Maybe I'll give it a try until the PR gets merged.

rmrfxyz avatar May 07 '24 08:05 rmrfxyz

You should actually change it to cpu, not to mps.

AlessandroFlati avatar May 07 '24 08:05 AlessandroFlati

Also, I'm unable to run any KAN model in GPU. I send to device (cuda) both the dataset and the model but keeps giving me this error: device = torch.device("cuda") dataset = {} dataset["train_input"] = torch.from_numpy(np.array(X_train)) dataset["test_input"] = torch.from_numpy(np.array(X_test)) dataset["train_label"] = torch.from_numpy(np.array(Y_train)) dataset["test_label"] =torch.from_numpy(np.array(Y_test)) for key, value in dataset.items(): dataset[key] = dataset[key].to(device) kan_model = KAN(width=[2, 1, grid_size * grid_size], grid=3, k=3, seed=0, device = device) kan_model.train(dataset, opt="LBFGS", steps=50, lamb=0, lamb_entropy=0, device = device)

Error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Also fixed by the PR.

AlessandroFlati avatar May 07 '24 08:05 AlessandroFlati

As you see, the problem stands in line self.scale_base = torch.nn.Parameter(torch.FloatTensor(scale_base).cuda()).requires_grad_(sb_trainable) which in a previous (bad) tentative of allowing people to use CUDA, forced the parameter to be on cuda. You can edit that line yourself if you just want to use CPU, but we should really just wait for the PR to be accepted.

I tried to edit that line using the code from your fork,(device = torch.device('cpu')),but it still AssertionError: Torch not compiled with CUDA enabled

Justin-12138 avatar May 07 '24 09:05 Justin-12138

That's strange. Could you please create a reproducible gist/snippet where I can try to reproduce your case in order to further expand the PR if needed? That would very much appreciated!

AlessandroFlati avatar May 07 '24 11:05 AlessandroFlati

That's strange. Could you please create a reproducible gist/snippet where I can try to reproduce your case in order to further expand the PR if needed? That would very much appreciated!

Sorry,My falut,I just copied the code in /kan from your fork ,I thought you have editted,I edit those line ,It works,But get some new errors when I ran below:

dataset = {}
train_input, train_label = make_moons(n_samples=1000, shuffle=True, noise=0.1, random_state=None)
test_input, test_label = make_moons(n_samples=1000, shuffle=True, noise=0.1, random_state=None)

dataset['train_input'] = torch.from_numpy(train_input)
dataset['test_input'] = torch.from_numpy(test_input)
dataset['train_label'] = torch.from_numpy(train_label[:, None])
dataset['test_label'] = torch.from_numpy(test_label[:, None])
device = torch.device('cpu')
X = dataset['train_input']
y = dataset['train_label']

plt.scatter(X[:, 0], X[:, 1], c=y[:, 0])

model = KAN(width=[2, 1], grid=3, k=3, device=device)

def train_acc():
    return torch.mean((torch.round(model(dataset['train_input'])[:, 0]) == dataset['train_label'][:, 0]).float())

def test_acc():
    return torch.mean((torch.round(model(dataset['test_input'])[:, 0]) == dataset['test_label'][:, 0]).float())

results = model.train(dataset, opt="LBFGS", steps=20, metrics=(train_acc, test_acc))
print(results['train_acc'][-1], results['test_acc'][-1])

got errors like this:

description:   0%|                                                           | 0/20 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "C:\Users\JUSTIN200\Desktop\pykan\example\", line 32, in <module>
    results = model.train(dataset, opt="LBFGS", steps=20, metrics=(train_acc, test_acc))
  File "C:\Users\JUSTIN200\.conda\envs\kan\lib\site-packages\kan\", line 899, in train
  File "C:\Users\JUSTIN200\.conda\envs\kan\lib\site-packages\kan\", line 244, in update_grid_from_samples
  File "C:\Users\JUSTIN200\.conda\envs\kan\lib\site-packages\kan\", line 312, in forward
    x_numerical, preacts, postacts_numerical, postspline = self.act_fun[l](x)
  File "C:\Users\JUSTIN200\.conda\envs\kan\lib\site-packages\torch\nn\modules\", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\JUSTIN200\.conda\envs\kan\lib\site-packages\torch\nn\modules\", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\JUSTIN200\.conda\envs\kan\lib\site-packages\kan\", line 175, in forward
    y = coef2curve(x_eval=x, grid=self.grid[self.weight_sharing], coef=self.coef[self.weight_sharing], k=self.k, device=self.device)  # shape (size, batch)
  File "C:\Users\JUSTIN200\.conda\envs\kan\lib\site-packages\kan\", line 100, in coef2curve
    y_eval = torch.einsum('ij,ijk->ik', coef, B_batch(x_eval, grid, k, device=device))
  File "C:\Users\JUSTIN200\.conda\envs\kan\lib\site-packages\torch\", line 380, in einsum
    return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
RuntimeError: expected scalar type Double but found Float
os:Windows 11
torch_version:Version: 2.2.2

Justin-12138 avatar May 07 '24 12:05 Justin-12138

I just think, as the RuntimeError describes, you do not have to cast to float through .float(), or maybe cast it as double

AlessandroFlati avatar May 07 '24 12:05 AlessandroFlati

I just think, as the RuntimeError describes, you do not have to cast to float through .float(), or maybe cast it as double



dataset['train_input'] = torch.from_numpy(train_input).float()
dataset['test_input'] = torch.from_numpy(test_input).float()
dataset['train_label'] = torch.from_numpy(train_label[:, None]).float()
dataset['test_label'] = torch.from_numpy(test_label[:, None]).float()

It works for me

train loss: 1.58e-01 | test loss: 1.62e-01 | reg: 1.94e+00 : 100%|██| 20/20 [00:01<00:00, 16.32it/s]
1.0 0.996999979019165

Justin-12138 avatar May 07 '24 13:05 Justin-12138