pykan icon indicating copy to clipboard operation
pykan copied to clipboard

This will solve CPU-only, CUDA-only and any mix of them.

Open AlessandroFlati opened this issue 9 months ago • 3 comments

This solves the post-fix_symbolic problem with cuda, the initialize_from_another_model problem with cuda, and the cpu problem related (already mentioned in this PR) that forced to use cuda.

AlessandroFlati avatar May 06 '24 17:05 AlessandroFlati

@KindXiaoming This should close many issues related to using CUDA. To work properly, I recommend updating requirements.txt to the following

matplotlib==3.6.2
numpy==1.26.4
scikit-learn==1.4.2
setuptools==69.5.1
sympy==1.11.1
torch==2.2.2
tqdm==4.66.2

Please let me know if you want me to make another PR or you'll handle this by yourself.

AlessandroFlati avatar May 06 '24 17:05 AlessandroFlati

There's another device missing in https://github.com/KindXiaoming/pykan/blob/master/kan/KAN.py#L205. I've addressed it in my fork at https://github.com/Jim137/pykan/tree/develop. Would you be open to merging my changes and submitting a pull request together?

Jim137 avatar May 06 '24 17:05 Jim137

Good point, I added it.

AlessandroFlati avatar May 06 '24 18:05 AlessandroFlati

I don't know why but if use MPS(Apple SIlicon) to loss is nan.

model.train(dataset, opt="LBFGS", steps=20, lamb=0.01, lamb_entropy=10., device=device.type);
train loss: nan | test loss: nan | reg: nan : 100%|█████████████████| 20/20 [00:03<00:00,  5.11it/s]

brainer3220 avatar May 07 '24 06:05 brainer3220

@brainer3220 I'm afraid I can't help too much with MPS, but it seems nonetheless a common issue between MPS and Torch (see https://github.com/pytorch/pytorch/issues/112834, for example).

AlessandroFlati avatar May 07 '24 06:05 AlessandroFlati

I am trying to run the given example of KAN in colab with @AlessandroFlati AlessandroFlati:develop implementation: image

Still getting the above error. I used the following requirements: matplotlib==3.6.2 numpy==1.26.4 scikit-learn==1.4.2 setuptools==69.5.1 sympy==1.11.1 torch==2.2.1 tqdm==4.66.2

In case I want to run on cpu, it says no NVIDIA drivers selected.

Any help to resolve this is appreciated. Thanks!

rajdeepbanerjee-git avatar May 07 '24 09:05 rajdeepbanerjee-git

I am trying to run the given example of KAN in colab with @AlessandroFlati AlessandroFlati:develop implementation: image

Still getting the above error. I used the following requirements: matplotlib==3.6.2 numpy==1.26.4 scikit-learn==1.4.2 setuptools==69.5.1 sympy==1.11.1 torch==2.2.1 tqdm==4.66.2

In case I want to run on cpu, it says no NVIDIA drivers selected.

Any help to resolve this is appreciated. Thanks!

First you need to initialize a torch.device like this device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Then use device in all constructors device = device

Finally, you will need to put the dataset tensor on device doing this:

dataset['train_input'] = dataset['train_input'].to(device)
dataset['train_label'] = dataset['train_label'].to(device)

SimoSbara avatar May 07 '24 10:05 SimoSbara

Thanks, now I am able to run on colab GPU. But the CPU problem persists.

rajdeepbanerjee-git avatar May 07 '24 10:05 rajdeepbanerjee-git

Thanks, now I am able to run on colab GPU. But the CPU problem persists.

This pull request solves it, you can try to modify pykan like in those commits: https://github.com/KindXiaoming/pykan/pull/98/commits/d606bd88bd76f867ef1e2e0780d68fb4f378ce65 https://github.com/KindXiaoming/pykan/pull/98/commits/c857dd65b737ce5f1845555416ddef8ba7865ff8

I had the same problem #75.

SimoSbara avatar May 07 '24 11:05 SimoSbara

Hi @AlessandroFlati, would appreciate you make another PR for me! Thanks in advance :)

KindXiaoming avatar May 07 '24 12:05 KindXiaoming

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") device(type='cuda')

print(torch.cuda.is_available()) True

model.to(device)

dataset['train_input'] = dataset['train_input'].to(device) dataset['train_label'] = dataset['train_label'].to(device)

but there is still a problem

--> 170 x = torch.einsum('ij,k->ikj', x, torch.ones(self.out_dim, device=self.device)).reshape(batch, self.size).permute(1, 0) 171 preacts = x.permute(1, 0).clone().reshape(batch, self.out_dim, self.in_dim) 172 base = self.base_fun(x).permute(1, 0) # shape (batch, size)

File E:\anaconda\envs\4torch2\lib\site-packages\torch\functional.py:380, in einsum(*args) 375 return einsum(equation, *_operands) 377 if len(operands) <= 2 or not opt_einsum.enabled: 378 # the path for contracting 0 or 1 time(s) is already optimized 379 # or the user has disabled using opt_einsum --> 380 return _VF.einsum(equation, operands) # type: ignore[attr-defined] 382 path = None 383 if opt_einsum.is_available():

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

alpaca202204 avatar May 08 '24 13:05 alpaca202204

You shouldn't just model.to(device), but rather create both model and dataset passing device=device argument. Besides, you're missing test_input and test_label keys for dataset

AlessandroFlati avatar May 08 '24 14:05 AlessandroFlati