Dear @vincefn,

I have made significant changes to the register assignment logic of VkFFT, which should improve the generated kernel instructions usage, shared memory transfers, and occupancy, so I am going to make a new release soon. There have also been some changes to how VkFFT stores user-provided data inside the application, so pyvkfft has to be slightly updated (the buffers are now void* const* and VkFFT now creates a copy of all pointer contents the user provides to it during initialization).

My local pyvkfft test suite seems to be running without errors (it is still in process though).

The updated files are: vkfft_cuda.txt vkfft_opencl.txt

Best regards, Dmitrii

Sep 23 '24 14:09 DTolm

Forget my previous message - actually you were just missing par.commandQueue = &q; in fft() and ifft() , without that I get a segmentation fault when testing on my macbook.

Sep 25 '24 10:09 vincefn

@DTolm - some errors found for 1D DST1 with cuda: on an A40, the transform with size 5617 is incorrect. On a H100, other sizes (5617, 7129 7547 8126, 9751, 10081) also fail.

e.g. see http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2024-09-26-h100cu/pyvkfft-test.html

Sep 26 '24 11:09 vincefn

There are other errors (failure to compile) on H100 using OpenCL for R2C/DCT2/DCT3 double precision 2D and 3D transforms of size 180: http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2024-09-26-h100cl/pyvkfft-test.html

Sep 26 '24 12:09 vincefn

@DTolm there are also a number of failures/errors on non-radix (Bluestein/Rader) transforms which you can see e.g. in the H100 cuda testsuite: http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2024-09-26-h100cu/pyvkfft-test.html

Sep 27 '24 06:09 vincefn

@vincefn issues should be fixed now, but I could only test some of the big systems on 3070 for now - for the bigger shared memory size of H100 I only checked that the printed kernels no longer do the same error as before.

Sep 27 '24 08:09 DTolm

OK, I'll run the new tests - this seems to have also fixed the n=2988 DST1 transforms on the AMD gfx900 card..

Sep 27 '24 10:09 vincefn

Tests mostly look good so far. One strange thing are some timeouts on the H100 with cuda, for specific sizes: n=4915 for 1D DST1 (single precision, in and out of place), and n=51450 for 1D R2C (single precision inplace). That suggests somthing goes wrong when generating/compiling the kernel:

http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2024-09-27-h100cu/pyvkfft-test.html

Sep 27 '24 23:09 vincefn

Among other isolated errors:

on the A100/cuda, the 1D DCT1 transform of sizes 2558 and 2559 (double precision) fail with a timeout
on the H100/cuda, it is the sizes 5118 and 5119 (also double precision) which fail with a timeout.

Since those sizes are related (x2) due to the mapping of the DCT1, I'm assuming these errors are related.

http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2024-09-27-a100cu/pyvkfft-test.html http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2024-09-27-h100cu/pyvkfft-test.html

In addition to those and the previous message, there are a few other timeouts which may be worth investigating (not on the GTX 1080 or gfx900, these are small cards which may suffer from parallel tests):

A100/opencl C2C 1D n=10 (single, inplace)
A100/cuda DST1 2D n= 641 (single, inplace)
H100/cuda DCT4 1D n=6258 6362 6546 (double, out-of-place)

It may be necessary to run those tests with --serial to see the actual error, if any.

Sep 28 '24 11:09 vincefn

Hm, I tried these systems sequentially on gh200 (it also has an H100 GPU) and on 3070 and they seem to pass. Maybe compilation takes more time for some of these kernels now and it hangs? This doesn't explain what happens with the n=10 failure though. I am not sure how to proceed with this.

Sep 28 '24 13:09 DTolm

The A100-DCT1-n1=2558 and 2559 and H100 R2C (I forgot to write that in the message above, on H100 it's not the DCT1 which fails, but the R2C, hence the x2 size I mention) n=5118 and 5119 should be reproducible, they're too closely related.

Of course it can be an issue with the toolkit - I'm running with 12.2, I could also test with 12.3 (I have not installed later ones yet).

On the good news, the A40 cuda test is pristine so far, and on opencl theer's just one timeout.

The timeout only triggers after 30s, normally it should be much more than needed.

Let's see how this ends, I can retry the tests individually to see if there's any output, and if newer toolkits help.

Sep 28 '24 21:09 vincefn

I re-ran manually some tests which failed on the A40 and A100 (not H100, we only have 1 node with 2 GPUs, still running the test suite). Here's what I got:

A100 opencl => all tests which failed (timeout) pass once serialised :-)

A100/opencl/C2D/1D/inplace/single/lut n=10 timeout

pyvkfft-test --systematic --backend pyopencl --gpu a100 --max-nb-tests 0 --serial --ndim 1 --range 2 15 --radix --inplace --lut --norm 1 --range-mb 0 4100 OK

A100/opencl/DST2/2D/outofplace/single n=3452 timeout

pyvkfft-test --systematic --backend pyopencl --gpu a100 --max-nb-tests 0 --ndim 2 --range 3440 3455 --dst 2 --bluestein --norm 1 --range-mb 0 4100 --serial OK

A100/opencl/DST2/2D/outofplace/single n=18 timeout

pyvkfft-test --systematic --backend pyopencl --gpu a100 --max-nb-tests 0 --serial --ndim 2 --range 14 20 --r2c --radix --double --norm 0 --range-mb 0 4100 OK

A100/opencl/DCT1/2D/outofplace/single n=2976 timeout

pyvkfft-test --systematic --backend pyopencl --gpu a100 --max-nb-tests 0 --serial --ndim 2 --range 2970 2980 --dct 1 --bluestein --norm 1 --range-mb 0 4100 OK

A100 cuda => multiple segfaults

A100/cuda/DST1/2D/out/single/lut n=641 timeout

pyvkfft-test --systematic --backend cupy --gpu a100 --max-nb-tests 0 --serial --ndim 2 --range 635 645 --dst 1 --radix --lut --norm 1 --range-mb 0 4100 Segmentation fault at n=641-No segfault without LUT !

A100/cuda/DST1/2D/inplace/single/lut n=641 timeout

pyvkfft-test --systematic --backend cupy --gpu a100 --max-nb-tests 0 --serial --ndim 2 --range 635 645 --dst 1 --radix --inplace --lut --norm 1 --range-mb 0 4100 Segmentation fault at n=641-No segfault without LUT !

A100/cuda/DCT1/1D/out/double n=2558, 2559 timeout

pyvkfft-test --systematic --backend cupy --gpu a100 --max-nb-tests 0 --serial --ndim 1 --range 2550 2565 --dct 1 --bluestein --double --norm 1 --range-mb 0 4100 Segfault at 2558 (same starting directly at 2559)

A100/cuda/DCT1/2D/out/double n=2558, 2559 timeout

pyvkfft-test --systematic --backend cupy --gpu a100 --max-nb-tests 0 --serial --ndim 2 --range 2550 2565 --dct 1 --bluestein --double --norm 1 --range-mb 0 4100 Segfault at 2558 (same starting directly at 2559)

A100/cuda/DST1/1D/out/double n=2556, 2557, 2558 timeout [probably the same errors with ndim=2D, same system]

pyvkfft-test --systematic --backend cupy --gpu a100 --max-nb-tests 0 --serial --ndim 1 --range 2550 2560 --dst 1 --bluestein --double --norm 1 --range-mb 0 4100 Segfault at 2556, 2557, 2558 - but 2559, 2560 pass

A100/cuda/DCT2/1D/out/double n=5107, 5111, 5113, 5114, timeout (and gave up) NOTE: seems to be the same error for DCT3, DCT4, DST2, DST3, DST4

pyvkfft-test --systematic --backend cupy --gpu a100 --max-nb-tests 0 --serial --ndim 1 --range 5100 5120 --dct 2 --bluestein --double --norm 1 --range-mb 0 4100 Segfault at 5107 (same starting directly at 5111, 5113, 5114)

A40 OpenCl => DCT1 2D accuracy issue

A40/opencl/R2C/2D/double n=3766 timeout

pyvkfft-test --systematic --backend pyopencl --gpu a40 --max-nb-tests 0 --serial --ndim 2 --range 3760 3770 --r2c --bluestein --double --norm 1 --range-mb 0 4100 OK

A40/opencl/DCT1/2D/double n=2901,3191 accuracy error => Same accuracy issue on H100 ? Both in ant out-of-place

pyvkfft-test --systematic --backend pyopencl --gpu a40 --max-nb-tests 0 --serial --ndim 2 --range 2898 2905 --dct 1 --bluestein --double --norm 1 --range-mb 0 4100 pyvkfft-test --systematic --backend pyopencl --gpu a40 --max-nb-tests 0 --serial --ndim 2 --range 3188 3195 --dct 1 --bluestein --double --norm 1 --range-mb 0 4100 Both give an accuracy error (looks like a single incorrect value)

A40/opencl/DCT4/1D/double n=8618,9290 timeout

pyvkfft-test --systematic --backend pyopencl --gpu a40 --max-nb-tests 0 --serial --ndim 1 --range 8614 8620 --dct 4 --bluestein --double --norm 1 --range-mb 0 4100 OK pyvkfft-test --systematic --backend pyopencl --gpu a40 --max-nb-tests 0 --serial --ndim 1 --range 9285 9295 --dct 4 --bluestein --double --norm 1 --range-mb 0 4100 OK

Bottom line

Regarding the segmentation faults - not much can be done except re-testing with another toolkit. The A100 segfault at n=641 with LUT but not without is interesting, but again a segfault - not much we can do.

The accuracy issues on DCT1/2D/double n=2901,3191 (seems identical on the H100, the A100 has still 3 tests to go to reach that one) can probably be investigated ?

Sep 29 '24 09:09 vincefn

Wow, thank you so much for this report! I will investigate the failures today. The segmentation faults should be reproducible and fixable. Regarding the timeouts, I am not sure what can cause them.

I will also need to tweak the parameters for Apple Silicon performance a little bit, but now it is possible to make the next-level kernel configurator. It can work like this: the user specifies the actual decomposition of the number, for example, 5152=141523. Then the user specifies the minimal number of registers per thread. It can be any number, but it makes sense that it is not lower than the smallest divisor in decomposition. For example, if we choose 14 VkFFT will generate code for 345 threads doing radix-14, then 322/345 will do radix-15, then 210/345 will do radix-23 (max registers per thread is 23). If we choose 23 VkFFT will generate a code for 210 threads doing radix-23, then 173/210 will do 2x radix-14 (and thread 173 will do only one), then 161 threads will do 2x radix-15 (max registers per thread in this case is 30). It gets a bit more tricky when combining this with Rader transform, but the general idea should be clear. In the last update I tried to come up with a general idea on how to split numbers in the best way possible for a particular GPU, but it is a really hard multi-parameter task, depending on the total number of divisors, max number of registers used, ratio between max and min number of registers, number of warps active, number of FFTs done per one SM. This can also be a foundation for code inlining in user kernels.

Sep 29 '24 13:09 DTolm

Wow, thank you so much for this report! I will investigate the failures today. The segmentation faults should be reproducible and fixable. Regarding the timeouts, I am not sure what can cause them.

To clarify- when running the testsuite, there are two types of 'timeouts':

real timeouts when the kernel takes too long to prepare (or the transform, but compilation is more likely). These will pass when testing using --serial (like the A40 OpenCL tests)
segfaults: since i'm using parallel process for testing, a segfault is just perceived as a test never returning. Probably an issue when compiling (I can't differentiate between compilation and execution of the kernel).

I will also need to tweak the parameters for Apple Silicon performance a little bit, but now it is possible to make the next-level kernel configurator. It can work like this: the user specifies the actual decomposition of the number, for example, 5152=14x15x23. Then the user specifies the minimal number of registers per thread. It can be any number, but it makes sense that it is not lower than the smallest divisor in decomposition. For example, if we choose 14 VkFFT will generate code for 345 threads doing radix-14, then 322/345 will do radix-15, then 210/345 will do radix-23 (max registers per thread is 23). If we choose 23 VkFFT will generate a code for 210 threads doing radix-23, then 173/210 will do 2x radix-14 (and thread 173 will do only one), then 161 threads will do 2x radix-15 (max registers per thread in this case is 30). It gets a bit more tricky when combining this with Rader transform, but the general idea should be clear. In the last update I tried to come up with a general idea on how to split numbers in the best way possible for a particular GPU, but it is a really hard multi-parameter task, depending on the total number of divisors, max number of registers used, ratio between max and min number of registers, number of warps active, number of FFTs done per one SM. This can also be a foundation for code inlining in user kernels.

Ah, sounds nice for low-level tweaking !

Did you also continue working on kernel callbacks - that can be very useful for many applications.

Sep 29 '24 15:09 vincefn

So the issue with 2901 and 3191 was a weird one - there was a round-up of a floating number (2901/(double)10 * 100) which resulted in 29011 and that 1 difference was causing out-of-bounds accesses by ~10 numbers. I will need to check if this is related to the sizes causing timeouts.

The kernel callbacks are planned for the future, similar to the inlining, but I am not sure when.

Sep 29 '24 22:09 DTolm

I started to test the new code on the H100 (actually it also fails on the A40) with a test that failed, and I get this error (C2C 1D non-radix float32, the failure is there both with and without lut):

command: pyvkfft-test --systematic --backend cupy --gpu h100 --max-nb-tests 0 --serial --ndim 1 --range 9700 9710 --bluestein --lut --norm 1 --range-mb 0 4100

test_systematic (pyvkfft.test.test_fft.TestFFTSystematic.test_systematic) ... Starting 10 tests...
    cupy  C2C          (9700) axes=        None ndim=   1     R     1  complex64 lut=True inplace=0  norm=   1 C   FFT: n2=2.8e-07 ninf=3.2e-07 < 8.0e-06 (0.041) 1 iFFT: n2=2.6e-07 ninf=3.1e-07 < 8.0e-06 (0.039) 1 buf=    0   OK  
    cupy  C2C          (9701) axes=        None ndim=   1     R     1  complex64 lut=True inplace=0  norm=   1 C   FFT: n2=2.9e-07 ninf=3.7e-07 < 8.0e-06 (0.047) 1 iFFT: n2=3.1e-07 ninf=3.9e-07 < 8.0e-06 (0.048) 1 buf=    0   OK  

  test_systematic (pyvkfft.test.test_fft.TestFFTSystematic.test_systematic) (backend='cupy', shape=(np.int64(9703),), ndim=1, dtype=dtype('float32'), norm=1, use_lut=True, inplace=False, r2c=False, dct=False, dst=False, fstride=False) ... ERROR
  test_systematic (pyvkfft.test.test_fft.TestFFTSystematic.test_systematic) (backend='cupy', shape=(np.int64(9704),), ndim=1, dtype=dtype('float32'), norm=1, use_lut=True, inplace=False, r2c=False, dct=False, dst=False, fstride=False) ... ERROR
  test_systematic (pyvkfft.test.test_fft.TestFFTSystematic.test_systematic) (backend='cupy', shape=(np.int64(9705),), ndim=1, dtype=dtype('float32'), norm=1, use_lut=True, inplace=False, r2c=False, dct=False, dst=False, fstride=False) ... ERROR
  test_systematic (pyvkfft.test.test_fft.TestFFTSystematic.test_systematic) (backend='cupy', shape=(np.int64(9706),), ndim=1, dtype=dtype('float32'), norm=1, use_lut=True, inplace=False, r2c=False, dct=False, dst=False, fstride=False) ... ERROR
  test_systematic (pyvkfft.test.test_fft.TestFFTSystematic.test_systematic) (backend='cupy', shape=(np.int64(9707),), ndim=1, dtype=dtype('float32'), norm=1, use_lut=True, inplace=False, r2c=False, dct=False, dst=False, fstride=False) ... ERROR
  test_systematic (pyvkfft.test.test_fft.TestFFTSystematic.test_systematic) (backend='cupy', shape=(np.int64(9708),), ndim=1, dtype=dtype('float32'), norm=1, use_lut=True, inplace=False, r2c=False, dct=False, dst=False, fstride=False) ... ERROR
  test_systematic (pyvkfft.test.test_fft.TestFFTSystematic.test_systematic) (backend='cupy', shape=(np.int64(9709),), ndim=1, dtype=dtype('float32'), norm=1, use_lut=True, inplace=False, r2c=False, dct=False, dst=False, fstride=False) ... ERROR
  test_systematic (pyvkfft.test.test_fft.TestFFTSystematic.test_systematic) (backend='cupy', shape=(np.int64(9710),), ndim=1, dtype=dtype('float32'), norm=1, use_lut=True, inplace=False, r2c=False, dct=False, dst=False, fstride=False) ... ERROR
Finished 10 tests in 00h 00m 04s

======================================================================
ERROR: test_systematic (pyvkfft.test.test_fft.TestFFTSystematic.test_systematic) (backend='cupy', shape=(np.int64(9703),), ndim=1, dtype=dtype('float32'), norm=1, use_lut=True, inplace=False, r2c=False, dct=False, dst=False, fstride=False)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/esrf/favre/.conda/envs/pyvkfft-test/lib/python3.12/site-packages/pyvkfft/test/test_fft.py", line 1116, in test_systematic
    res = test_accuracy_kwargs(v)
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/esrf/favre/.conda/envs/pyvkfft-test/lib/python3.12/site-packages/pyvkfft/accuracy.py", line 589, in test_accuracy_kwargs
    return test_accuracy(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/esrf/favre/.conda/envs/pyvkfft-test/lib/python3.12/site-packages/pyvkfft/accuracy.py", line 526, in test_accuracy
    n2i, nii = l2(d, d1_gpu.get()), li(d, d1_gpu.get())
                     ^^^^^^^^^^^^
  File "cupy/_core/core.pyx", line 1771, in cupy._core.core._ndarray_base.get
  File "cupy/_core/core.pyx", line 1858, in cupy._core.core._ndarray_base.get
  File "cupy/cuda/memory.pyx", line 586, in cupy.cuda.memory.MemoryPointer.copy_to_host_async
  File "cupy_backends/cuda/api/runtime.pyx", line 606, in cupy_backends.cuda.api.runtime.memcpyAsync
  File "cupy_backends/cuda/api/runtime.pyx", line 146, in cupy_backends.cuda.api.runtime.check_status
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorIllegalInstruction: an illegal instruction was encountered

The following tests fail with an allocation failure, but probably due to leaving cuda in a faulty state.

Sep 30 '24 10:09 vincefn

Note that compared to previous tests, it's still an improvement as the first failur for that test is at n=9703, before it was failing at 9458.

Sep 30 '24 14:09 vincefn

I now have again access to GH200 and ran pyvkfft there with pycuda (I have some troubles installing cupy there)

pyvkfft-test --systematic --backend pycuda --gpu gh200 --max-nb-tests 0 --serial --ndim 1 --range 9700 9710 --bluestein --lut --norm 1 --range-mb 0  4100
test_systematic (pyvkfft.test.test_fft.TestFFTSystematic.test_systematic) ... Starting 10 tests...
  pycuda  C2C          (9700) axes=        None ndim=   1     R     1  complex64 lut=True inplace=0  norm=   1 C   FFT: n2=2.7e-07 ninf=3.5e-07 < 8.0e-06 (0.044) 1 iFFT: n2=2.6e-07 ninf=2.7e-07 < 8.0e-06 (0.034) 1 buf=    0   OK
  pycuda  C2C          (9701) axes=        None ndim=   1     R     1  complex64 lut=True inplace=0  norm=   1 C   FFT: n2=2.9e-07 ninf=3.0e-07 < 8.0e-06 (0.038) 1 iFFT: n2=3.1e-07 ninf=3.5e-07 < 8.0e-06 (0.044) 1 buf=    0   OK
  pycuda  C2C          (9703) axes=        None ndim=   1     R     1  complex64 lut=True inplace=0  norm=   1 C   FFT: n2=3.5e-07 ninf=3.4e-07 < 8.0e-06 (0.043) 1 iFFT: n2=3.5e-07 ninf=3.4e-07 < 8.0e-06 (0.043) 1 buf=    0   OK
  pycuda  C2C          (9704) axes=        None ndim=   1     B     1  complex64 lut=True inplace=0  norm=   1 C   FFT: n2=3.7e-07 ninf=4.1e-07 < 8.0e-06 (0.052) 1 iFFT: n2=3.9e-07 ninf=4.3e-07 < 8.0e-06 (0.053) 1 buf=    0   OK
  pycuda  C2C          (9705) axes=        None ndim=   1     R     1  complex64 lut=True inplace=0  norm=   1 C   FFT: n2=3.9e-07 ninf=4.1e-07 < 8.0e-06 (0.051) 1 iFFT: n2=3.9e-07 ninf=4.3e-07 < 8.0e-06 (0.054) 1 buf=    0   OK
  pycuda  C2C          (9706) axes=        None ndim=   1     R     1  complex64 lut=True inplace=0  norm=   1 C   FFT: n2=3.3e-07 ninf=3.5e-07 < 8.0e-06 (0.044) 1 iFFT: n2=3.3e-07 ninf=3.4e-07 < 8.0e-06 (0.042) 1 buf=    0   OK
  pycuda  C2C          (9707) axes=        None ndim=   1     R     1  complex64 lut=True inplace=0  norm=   1 C   FFT: n2=3.6e-07 ninf=3.6e-07 < 8.0e-06 (0.045) 1 iFFT: n2=3.7e-07 ninf=3.6e-07 < 8.0e-06 (0.045) 1 buf=    0   OK
  pycuda  C2C          (9708) axes=        None ndim=   1     B     1  complex64 lut=True inplace=0  norm=   1 C   FFT: n2=3.8e-07 ninf=4.4e-07 < 8.0e-06 (0.056) 1 iFFT: n2=3.7e-07 ninf=3.9e-07 < 8.0e-06 (0.049) 1 buf=    0   OK
  pycuda  C2C          (9709) axes=        None ndim=   1     R     1  complex64 lut=True inplace=0  norm=   1 C   FFT: n2=2.9e-07 ninf=3.1e-07 < 8.0e-06 (0.039) 1 iFFT: n2=2.8e-07 ninf=2.9e-07 < 8.0e-06 (0.036) 1 buf=    0   OK
  pycuda  C2C          (9710) axes=        None ndim=   1     B     1  complex64 lut=True inplace=0  norm=   1 C   FFT: n2=3.6e-07 ninf=4.3e-07 < 8.0e-06 (0.053) 1 iFFT: n2=3.6e-07 ninf=4.0e-07 < 8.0e-06 (0.050) 1 buf=    0   OK
Finished 10 tests in 00h 00m 21s
ok

----------------------------------------------------------------------
Ran 1 test in 21.721s

I believe these failures are unrelated to the 2901 and 3191 issues though.

Sep 30 '24 16:09 DTolm

I double-checked I did not make any error, it's failing on the A40 (I use the 12.3.1 toolkit there), also with pycuda (should be equivalent to cupy anyway):

favre@gpu4-04:~/dev/pyvkfft$ pyvkfft-test --systematic --backend pycuda --gpu a40 --max-nb-tests 0 --serial --ndim 1 --range 9700 9710 --bluestein --lut --norm 1 --range-mb 0  4100
test_systematic (pyvkfft.test.test_fft.TestFFTSystematic.test_systematic) ... Starting 10 tests...
  pycuda  C2C          (9700) axes=        None ndim=   1     R     1  complex64 lut=True inplace=0  norm=   1 C   FFT: n2=2.8e-07 ninf=3.0e-07 < 8.0e-06 (0.038) 1 iFFT: n2=2.6e-07 ninf=2.6e-07 < 8.0e-06 (0.033) 1 buf=    0   OK  
  pycuda  C2C          (9701) axes=        None ndim=   1     R     1  complex64 lut=True inplace=0  norm=   1 C   FFT: n2=2.9e-07 ninf=2.7e-07 < 8.0e-06 (0.034) 1 iFFT: n2=3.1e-07 ninf=2.9e-07 < 8.0e-06 (0.036) 1 buf=    0   OK  

  test_systematic (pyvkfft.test.test_fft.TestFFTSystematic.test_systematic) (backend='pycuda', shape=(np.int64(9703),), ndim=1, dtype=dtype('float32'), norm=1, use_lut=True, inplace=False, r2c=False, dct=False, dst=False, fstride=False) ... ERROR
  test_systematic (pyvkfft.test.test_fft.TestFFTSystematic.test_systematic) (backend='pycuda', shape=(np.int64(9704),), ndim=1, dtype=dtype('float32'), norm=1, use_lut=True, inplace=False, r2c=False, dct=False, dst=False, fstride=False) ... ERROR
  test_systematic (pyvkfft.test.test_fft.TestFFTSystematic.test_systematic) (backend='pycuda', shape=(np.int64(9705),), ndim=1, dtype=dtype('float32'), norm=1, use_lut=True, inplace=False, r2c=False, dct=False, dst=False, fstride=False) ... ERROR
  test_systematic (pyvkfft.test.test_fft.TestFFTSystematic.test_systematic) (backend='pycuda', shape=(np.int64(9706),), ndim=1, dtype=dtype('float32'), norm=1, use_lut=True, inplace=False, r2c=False, dct=False, dst=False, fstride=False) ... ERROR
  test_systematic (pyvkfft.test.test_fft.TestFFTSystematic.test_systematic) (backend='pycuda', shape=(np.int64(9707),), ndim=1, dtype=dtype('float32'), norm=1, use_lut=True, inplace=False, r2c=False, dct=False, dst=False, fstride=False) ... ERROR
  test_systematic (pyvkfft.test.test_fft.TestFFTSystematic.test_systematic) (backend='pycuda', shape=(np.int64(9708),), ndim=1, dtype=dtype('float32'), norm=1, use_lut=True, inplace=False, r2c=False, dct=False, dst=False, fstride=False) ... ERROR
  test_systematic (pyvkfft.test.test_fft.TestFFTSystematic.test_systematic) (backend='pycuda', shape=(np.int64(9709),), ndim=1, dtype=dtype('float32'), norm=1, use_lut=True, inplace=False, r2c=False, dct=False, dst=False, fstride=False) ... ERROR
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: an illegal memory access was encountered
  test_systematic (pyvkfft.test.test_fft.TestFFTSystematic.test_systematic) (backend='pycuda', shape=(np.int64(9710),), ndim=1, dtype=dtype('float32'), norm=1, use_lut=True, inplace=False, r2c=False, dct=False, dst=False, fstride=False) ... ERROR
Finished 10 tests in 00h 00m 05s

======================================================================
ERROR: test_systematic (pyvkfft.test.test_fft.TestFFTSystematic.test_systematic) (backend='pycuda', shape=(np.int64(9703),), ndim=1, dtype=dtype('float32'), norm=1, use_lut=True, inplace=False, r2c=False, dct=False, dst=False, fstride=False)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/esrf/favre/.conda/envs/pyvkfft-test/lib/python3.12/site-packages/pyvkfft/test/test_fft.py", line 1116, in test_systematic
    res = test_accuracy_kwargs(v)
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/esrf/favre/.conda/envs/pyvkfft-test/lib/python3.12/site-packages/pyvkfft/accuracy.py", line 589, in test_accuracy_kwargs
    return test_accuracy(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/esrf/favre/.conda/envs/pyvkfft-test/lib/python3.12/site-packages/pyvkfft/accuracy.py", line 413, in test_accuracy
    n2, ni = l2(d, d1_gpu.get()), li(d, d1_gpu.get())
                   ^^^^^^^^^^^^
  File "/home/esrf/favre/.conda/envs/pyvkfft-test/lib/python3.12/site-packages/pycuda/gpuarray.py", line 391, in get
    _memcpy_discontig(ary, self, async_=async_, stream=stream)
  File "/home/esrf/favre/.conda/envs/pyvkfft-test/lib/python3.12/site-packages/pycuda/gpuarray.py", line 1586, in _memcpy_discontig
    drv.memcpy_dtoh(dst, src.gpudata)
pycuda._driver.LogicError: cuMemcpyDtoH failed: an illegal memory access was encountered


favre@gpu4-04:~/dev/pyvkfft$ pyvkfft-info 
pyvkfft version: 2024.2b0     [git: 2024.1.4-3-g8e604e5-dirty]
VkFFT version:   1.3.5        [git: v1.3.4-21-g539be29]

CUDA support: True
  CUDA driver version:  12.2.0
  CUDA runtime version: 12.3.0
  CUDA compile version: 12.3.0
  pycuda available: True , version=2024.1.2
  cupy available:   True , version=13.2.0
  #CUDA devices:   1 (pycuda)
       0: NVIDIA A40

OpenCL support: True
PyOpenCL version: 2024.2.7
  OpenCL platform and devices (GPU only):
    platform: NVIDIA CUDA
      Vendor:  NVIDIA Corporation
      Version: OpenCL 3.0 CUDA 12.2.148
      #GPU devices: 4:
        NVIDIA A40
          Version:          OpenCL 3.0 CUDA
          Driver version:   535.183.01
          float64 support:  True
          float16 support:  False
    platform: Intel(R) OpenCL
      Vendor:  Intel(R) Corporation
      Version: OpenCL 1.2 LINUX
      #GPU devices: 0:

Sep 30 '24 17:09 vincefn

Now that's interesting, I went back and re-used the 12.2.2 toolkit (the driver is 12.2), and this time the tests passed on the A40 :

favre@gpu4-04:~$ pyvkfft-info 
pyvkfft version: 2024.2b0     [git: 2024.1.4-3-g8e604e5-dirty]
VkFFT version:   1.3.5        [git: v1.3.4-21-g539be29]

CUDA support: True
  CUDA driver version:  12.2.0
  CUDA runtime version: 12.2.0
  CUDA compile version: 12.2.0
  pycuda available: True , version=2024.1.2
  cupy available:   True , version=13.2.0
  #CUDA devices:   1 (pycuda)
       0: NVIDIA A40

OpenCL support: True
PyOpenCL version: 2024.2.7
  OpenCL platform and devices (GPU only):
    platform: NVIDIA CUDA
      Vendor:  NVIDIA Corporation
      Version: OpenCL 3.0 CUDA 12.2.148
      #GPU devices: 4:
        NVIDIA A40
          Version:          OpenCL 3.0 CUDA
          Driver version:   535.183.01
          float64 support:  True
          float16 support:  False
    platform: Intel(R) OpenCL
      Vendor:  Intel(R) Corporation
      Version: OpenCL 1.2 LINUX
      #GPU devices: 0:
favre@gpu4-04:~$ pyvkfft-test --systematic --backend pycuda --gpu a40 --max-nb-tests 0 --serial --ndim 1 --range 9700 9710 --bluestein --lut --norm 1 --range-mb 0  4100
test_systematic (pyvkfft.test.test_fft.TestFFTSystematic.test_systematic) ... Starting 10 tests...
  pycuda  C2C          (9700) axes=        None ndim=   1     R     1  complex64 lut=True inplace=0  norm=   1 C   FFT: n2=2.8e-07 ninf=3.7e-07 < 8.0e-06 (0.046) 1 iFFT: n2=2.6e-07 ninf=3.4e-07 < 8.0e-06 (0.043) 1 buf=    0   OK  
  pycuda  C2C          (9701) axes=        None ndim=   1     R     1  complex64 lut=True inplace=0  norm=   1 C   FFT: n2=2.9e-07 ninf=3.2e-07 < 8.0e-06 (0.040) 1 iFFT: n2=3.1e-07 ninf=3.4e-07 < 8.0e-06 (0.043) 1 buf=    0   OK  
  pycuda  C2C          (9703) axes=        None ndim=   1     R     1  complex64 lut=True inplace=0  norm=   1 C   FFT: n2=3.6e-07 ninf=4.0e-07 < 8.0e-06 (0.050) 1 iFFT: n2=3.6e-07 ninf=4.1e-07 < 8.0e-06 (0.052) 1 buf=    0   OK  
  pycuda  C2C          (9704) axes=        None ndim=   1     B     2  complex64 lut=True inplace=0  norm=   1 C   FFT: n2=3.7e-07 ninf=3.6e-07 < 8.0e-06 (0.045) 1 iFFT: n2=3.9e-07 ninf=4.0e-07 < 8.0e-06 (0.050) 1 buf=151.9kB OK  
  pycuda  C2C          (9705) axes=        None ndim=   1     R     1  complex64 lut=True inplace=0  norm=   1 C   FFT: n2=4.0e-07 ninf=4.3e-07 < 8.0e-06 (0.054) 1 iFFT: n2=4.1e-07 ninf=4.4e-07 < 8.0e-06 (0.054) 1 buf=    0   OK  
  pycuda  C2C          (9706) axes=        None ndim=   1     R     1  complex64 lut=True inplace=0  norm=   1 C   FFT: n2=3.5e-07 ninf=3.8e-07 < 8.0e-06 (0.048) 1 iFFT: n2=3.5e-07 ninf=3.5e-07 < 8.0e-06 (0.044) 1 buf=    0   OK  
  pycuda  C2C          (9707) axes=        None ndim=   1     R     1  complex64 lut=True inplace=0  norm=   1 C   FFT: n2=3.9e-07 ninf=4.4e-07 < 8.0e-06 (0.055) 1 iFFT: n2=3.9e-07 ninf=4.7e-07 < 8.0e-06 (0.059) 1 buf=    0   OK  
  pycuda  C2C          (9708) axes=        None ndim=   1     B     2  complex64 lut=True inplace=0  norm=   1 C   FFT: n2=3.8e-07 ninf=4.6e-07 < 8.0e-06 (0.058) 1 iFFT: n2=3.7e-07 ninf=3.9e-07 < 8.0e-06 (0.048) 1 buf=151.9kB OK  
  pycuda  C2C          (9709) axes=        None ndim=   1     R     1  complex64 lut=True inplace=0  norm=   1 C   FFT: n2=2.9e-07 ninf=2.9e-07 < 8.0e-06 (0.037) 1 iFFT: n2=2.9e-07 ninf=3.1e-07 < 8.0e-06 (0.039) 1 buf=    0   OK  
  pycuda  C2C          (9710) axes=        None ndim=   1     B     2  complex64 lut=True inplace=0  norm=   1 C   FFT: n2=3.7e-07 ninf=3.9e-07 < 8.0e-06 (0.048) 1 iFFT: n2=3.7e-07 ninf=3.2e-07 < 8.0e-06 (0.041) 1 buf=151.9kB OK  
Finished 10 tests in 00h 00m 18s

(I actually ran all the tests from 9400 to 9800 to see)

This is strange because we're supposed to have minor version compatibility, but it seems using the toolkit 12.3 induced an issue.

Well, time to relaunch the testsuite...

Sep 30 '24 18:09 vincefn

Just tried 3070 with CUDA 11.4, no issues. The GH200 was 12.4

Sep 30 '24 18:09 DTolm

A few errors (timeouts, so don't know the exact error, but they are reproducible) still appear:

A100/cuda/DST1/2D/single/lut n=641, in and out-of-place

pyvkfft-test --systematic --backend cupy --gpu a100 --max-nb-tests 0 --nproc 16 --ndim 2 --range 2 4500 --dst 1 --radix --lut --norm 1 --range-mb 0 4100 http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2024-09-30-a100cu/pyvkfft-test.html

H100/cuda/DST1/single/lut n=4915 in and out-of-place

pyvkfft-test --systematic --backend cupy --gpu h100 --max-nb-tests 0 --nproc 16 --ndim 1 --range 2 100000 --dst 1 --radix --lut --norm 1 --range-mb 0 4100

H100/cuda/R2C/single/lut n=51450 (norm=0 and 1)

pyvkfft-test --systematic --backend cupy --gpu h100 --max-nb-tests 0 --nproc 16 --ndim 1 --range 2 100000 --r2c --radix --inplace --lut --norm 1 --range-mb 0 4100

http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2024-09-30-h100cu/pyvkfft-test.html

Oct 01 '24 06:10 vincefn

Hi, looking at further tests on the H100/cuda, and running them individually, I can confirm all the errors so far are segmentation faults (R2C, DCT4, DST1).

Here are the individual command-lines for re-testing:

pyvkfft-test --systematic --backend cupy --gpu h100 --max-nb-tests 0 --serial --ndim 1 --range 4910 4920 --dst 1 --radix --lut --norm 1 --range-mb 0 4100
pyvkfft-test --systematic --backend cupy --gpu h100 --max-nb-tests 0 --serial --ndim 1 --range 4910 4920 --dst 1 --radix --inplace --lut --norm 1 --range-mb 0 4100
pyvkfft-test --systematic --backend cupy --gpu h100 --max-nb-tests 0 --serial --ndim 1 --range 51000 51500 --r2c --radix --inplace --lut --norm 1 --range-mb 0 4100
pyvkfft-test --systematic --backend cupy --gpu h100 --max-nb-tests 0 --serial --ndim 1 --range 51000 51500 --r2c --radix --inplace --lut --norm 0 --range-mb 0 4100
pyvkfft-test --systematic --backend cupy --gpu h100 --max-nb-tests 0 --serial --ndim 1 --range 5115 5120 --r2c --bluestein --double --norm 1 --range-mb 0 4100
pyvkfft-test --systematic --backend cupy --gpu h100 --max-nb-tests 0 --serial --ndim 1 --range 6250 6260 --dct 4 --bluestein --double --norm 1 --range-mb 0 4100
pyvkfft-test --systematic --backend cupy --gpu h100 --max-nb-tests 0 --serial --ndim 1 --range 7130 7140 --r2c --bluestein --inplace --double --norm 1 --range-mb 0 4100
pyvkfft-test --systematic --backend cupy --gpu h100 --max-nb-tests 0 --serial --ndim 1 --range 6255 6260 --dct 4 --bluestein --inplace --double --norm 1 --range-mb 0 4100

http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2024-09-30-h100cu/pyvkfft-test.html

These are the only significant errors apart from the A100 2D DST1 n=641 reported above (but the A100 tests are on hold, the nodes are busy with higher priority jobs)

Oct 02 '24 06:10 vincefn

Hello, all these tests passed on GH200 with CUDA 12.4 (I don't have access to the previous CUDA version on this machine). I will need to check the generated kernels, but these system sizes seem to be single-upload using all available shared memory, so the kernels generated are one of the most complex VkFFT can produce.

Oct 02 '24 10:10 DTolm

Hi - good to know they pass on GH200 - should mean we can assume there is a toolkit or driver problem. I may test later with a newer toolkit, and the same driver..

On the A100/cuda (in addition to A100/cuda/DST1/2D/single/lut n=641 reported above), some tests have continued and there are the same failures on DCT1 (1D and 2D, double-prec) for n= 2558, 2559. And DCT2 1D double prec n= 5107, 5111, 5113, 5114: pyvkfft-test --systematic --backend cupy --gpu a100 --max-nb-tests 0 --serial --ndim 1 --range 2550 2560 --dct 1 --bluestein --double --norm 1 --range-mb 0 4100 pyvkfft-test --systematic --backend cupy --gpu a100 --max-nb-tests 0 --serial --ndim 2 --range 2550 2560 --dct 1 --bluestein --double --norm 1 --range-mb 0 4100 pyvkfft-test --systematic --backend cupy --gpu a100 --max-nb-tests 0 --serial --ndim 1 --range 5100 5120 --dct 2 --bluestein --double --norm 1 --range-mb 0 4100

On the A40/cuda, some failures with DST2 1D (same on DST3): pyvkfft-test --systematic --backend cupy --gpu a40 --max-nb-tests 0 --serial --ndim 1 --range 6140 6160 --dst 2 --bluestein --lut --norm 1 --range-mb 0 4100

Oct 02 '24 18:10 vincefn

I tried 3070 (which should be similar to A40 but with fewer SMs) with CUDA 11.4, 12.3 and 12.6 with drivers 535, 545 and 560 and it passed DST2 tests on all of them. I don't know what else can be the cause of failures right now.

Oct 04 '24 08:10 DTolm

I re-tested the different systems failing on rented (cloud) instances with cuda=12.4 from the driver and all passed for A100/H100/A40.

The M1, GTX1080 and V100 are pristine so far, but will take a few more days to finish.

I have not looked at the convolution part which had a number of systems failing (https://github.com/vincefn/pyvkfft/issues/33) but I guess it's better to wrap 1.3.5 first.

So this looks good !

Oct 05 '24 08:10 vincefn

Dear @DTolm, can you make the release or do you need to make further changes (I'd rather avoid having to re-run all the tests TBH) ?

Oct 19 '24 18:10 vincefn

Hello @vincefn, sorry for the extremely long reply, I am now in the process of finishing my doctorate, so all the time is spent on that and not polishing the release. The new prime splitter still needs some polishing performance-wise (there are a few situations where it behaves worse than 1.3.4), but it should not produce any errors as its generality was already verified by the big test run. I hope I find time to do so in December. Once again, thanks for all the testing!

Dec 03 '24 09:12 DTolm

Dear @DTolm , I was wondering if there was any update regarding this new version - I'm running into issues for a new release of pyvkfft, so I'll need to revert some changes made in preparation for VkFFT 1.3.5, but it would be good to have plans for a new version with the new features ?

Jul 22 '25 10:07 vincefn

Following up on https://github.com/vincefn/pyvkfft/issues/39#issuecomment-2514009590 regarding VkFFT 1.3.5 release

I have updated pyvkfft devel branch to VkFFT e8e6a39 (latest develop), and the testsuite is now running .

Results will come in http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft/ (2025-07-25*) over the next few days.

Jul 25 '25 09:07 vincefn

pyvkfft pyvkfft copied to clipboard

VkFFT 1.3.5

A100 opencl => all tests which failed (timeout) pass once serialised :-)

A100/opencl/C2D/1D/inplace/single/lut n=10 timeout

A100/opencl/DST2/2D/outofplace/single n=3452 timeout

A100/opencl/DST2/2D/outofplace/single n=18 timeout

A100/opencl/DCT1/2D/outofplace/single n=2976 timeout

A100 cuda => multiple segfaults

A100/cuda/DST1/2D/out/single/lut n=641 timeout

A100/cuda/DST1/2D/inplace/single/lut n=641 timeout

A100/cuda/DCT1/1D/out/double n=2558, 2559 timeout

A100/cuda/DCT1/2D/out/double n=2558, 2559 timeout

A100/cuda/DST1/1D/out/double n=2556, 2557, 2558 timeout [probably the same errors with ndim=2D, same system]

A100/cuda/DCT2/1D/out/double n=5107, 5111, 5113, 5114, timeout (and gave up) NOTE: seems to be the same error for DCT3, DCT4, DST2, DST3, DST4

A40 OpenCl => DCT1 2D accuracy issue

A40/opencl/R2C/2D/double n=3766 timeout

A40/opencl/DCT1/2D/double n=2901,3191 accuracy error => Same accuracy issue on H100 ? Both in ant out-of-place

A40/opencl/DCT4/1D/double n=8618,9290 timeout

Bottom line

A100/cuda/DST1/2D/single/lut n=641, in and out-of-place

H100/cuda/DST1/single/lut n=4915 in and out-of-place

H100/cuda/R2C/single/lut n=51450 (norm=0 and 1)

pyvkfft
pyvkfft copied to clipboard