falkon icon indicating copy to clipboard operation
falkon copied to clipboard

Segmentation fault

Open ahabedsoltan opened this issue 1 year ago • 6 comments

Hello,

I've been using FALKON, and it functions well on a GPU with a small number of centers. However, when I increase the number of centers to around 64,000, I encounter a "Segmentation fault" error, causing the program to terminate.

Here's the sequence of events during the run:

falkon starts... MainProcess.MainThread::[Calcuating Preconditioner of size 128000] Preconditioner will run on 1 GPUs --MainProcess.MainThread::[Kernel] --MainProcess.MainThread::[Kernel] complete in 80.324s --MainProcess.MainThread::[Cholesky 1] Using parallel POTRF --MainProcess.MainThread::[Cholesky 1] complete in 47.152s --MainProcess.MainThread::[Copy triangular] --MainProcess.MainThread::[Copy triangular] complete in 17.284s --MainProcess.MainThread::[LAUUM(CUDA)] --MainProcess.MainThread::[LAUUM(CUDA)] complete in 56.486s --MainProcess.MainThread::[Cholesky 2] Segmentation fault

I'm curious about why this issue arises with a large number of centers. Previously, I've successfully used FALKON with up to 256,000 centers. It seems that with the current updated version, there are issues at this scale. Your assistance in resolving this matter would be greatly appreciated.

ahabedsoltan avatar Dec 24 '23 01:12 ahabedsoltan

We have tried setting the following options, yet the seg-fault persists. never_store_kernel=True chol_force_kernel=True no_single_kernel=False

parthe avatar Dec 28 '23 06:12 parthe

Here is a minimal working code that reproduces the error that was raised by @ahabedsoltan

import falkon, torch

n, N, M, d, bw = 200_000, 1000, 64_000, 1, 1.

accufun = lambda yt, yh: 100 * (yt.argmax(dim=1) == yh.argmax(dim=1)).sum() / yh.shape[0]

options = falkon.FalkonOptions(debug=True,
    never_store_kernel=True,
    chol_force_ooc=True,
    no_single_kernel=False)
kernel_fn_flk = falkon.kernels.LaplacianKernel(sigma=bw, opt=options)
model = falkon.Falkon(kernel=kernel_fn_flk, penalty=1e-6, M=M, options=options,
                      error_every=1, error_fn=accufun, maxiter=1)

X = torch.randn(n, d)
Y = torch.randn(n, d)
x = torch.randn(N, d)
y = torch.randn(N, d)
model.fit(X, Y, Xts=x, Yts=y)

parthe avatar Dec 29 '23 06:12 parthe

Hi! I think it was a bug in a small helper function, it should be fixed on master! Are you comfortable trying it out like this or do you prefer if I release a new version?

Giodiro avatar Jan 01 '24 15:01 Giodiro

Thank you. Could you please create a pre-built wheel for it? Each time I try to install it using the command 'pip install git+https://github.com/falkonml/falkon.git', the installation fails.

ahabedsoltan avatar Jan 01 '24 20:01 ahabedsoltan

Reinstalling falkon as follows solved the issue. @Giodiro Thanks for the quick bug-fix!

pip uninstall falkon
pip install --no-build-isolation git+https://github.com/FalkonML/falkon.git

parthe avatar Jan 02 '24 07:01 parthe

Thank you it resolved the issue.

ahabedsoltan avatar Jan 08 '24 22:01 ahabedsoltan