Gpu implementation has high relative error for provided examples
Hi, I think the GPU implementation has computation correctness errors.
I built everything with CUDA enabled, and the results I get from running the examples in the examples directory have high relative error, meaning the output is unusable. To compare, I also built everything for the cpu, with CUDA disabled, and their results have relative errors close to 0. Can you fix the CUDA computation errors?
Examples: Here are the CUDA-enabled GPU versions:
fmmtl/examples$ ./error_laplace
FMM in 0.216471 secs
FMM in 0.0459567 secs
FMM in 0.0447383 secs
Computing direct matvec...
Direct in 0.922225 secs
Vector relative error: 0.623028
Average relative error: 0.444972
Maximum relative error: 1.20022
fmmtl/examples$ ./error_biot
FMM in 0.252116 secs
FMM in 0.0779735 secs
FMM in 0.0770103 secs
Computing direct matvec...
Direct in 1.33767 secs
Vector relative error: 0.684924
Average relative error: 0.489844
Maximum relative error: 7.83254
fmmtl/examples$ ./error_barycentric
FMM in 0.152405 secs
FMM in 0.00364061 secs
FMM in 0.00367243 secs
Computing direct matvec...
Direct in 0.163656 secs
Vector relative error: 0.999659
Average relative error: 3.72861
Maximum relative error: 13593
fmmtl/examples$ ./error_img
Initializing source and N = 1048576 targets...
Building the kernel matrix...
Performing the kernel matrix-vector mult...
Computing direct kernel matrix-vector mult...
Computing the errors...
Min log error: -16
Max log error: 0
On the other hand, kNN appears to be correct, implying there's just some kind of scale error:
fmmtl/examples$ ./kNN
Construct: 0.000110769
Traverse: 0.00758732
Computing direct...
Direct: 0.0059448
Wrong counts: 0 of 1000
((0.00202416, 907), (0.00437583, 298), (0.00447187, 507), (0.00450509, 178), (0.00470155, 833))
In contrast, here are the errors from the CPU versions (I used make clean && make -j34 error_biot error_laplace error_img NO_CUDA=1 to build this, after I added -fPIC to CXXFLAGS. I couldn't compile error_barycentric, but I don't need it.):
fmmtl/examples$ ./error_laplace
FMM in 0.0921008 secs
FMM in 0.0676527 secs
FMM in 0.067525 secs
Computing direct matvec...
Direct in 0.895511 secs
Vector relative error: 3.36163e-05
Average relative error: 2.86249e-05
Maximum relative error: 0.000357396
fmmtl/examples$ ./error_biot
FMM in 0.135726 secs
FMM in 0.0971907 secs
FMM in 0.104078 secs
Computing direct matvec...
Direct in 1.3523 secs
Vector relative error: 4.42274e-05
Average relative error: 5.30808e-05
Maximum relative error: 0.00247941
fmmtl/examples$ ./error_img
Initializing source and N = 1048576 targets...
Building the kernel matrix...
Performing the kernel matrix-vector mult...
Computing direct kernel matrix-vector mult...
Computing the errors...
Min log error: -16
Max log error: -1.90081
I'll include more about the unit tests in the next reply.
Here is some info about my OS and hardware: OS: Ubuntu 18.04 Compiled with Boost 1.65, g++-4.8, and nvcc V10.0.130. GPU: Nvidia Geforce GTX 1080, Driver 440.82, Cuda Driver 10.2 / Runtime 10.0, capability 6.1.
I also tried replacing all doubles with floats in the codebase (and changing some of the epsilon tolerances accordingly), and tested error_biot. The CPU version matched the correct results above, and the GPU version produced the same erroneous results.
If you're interested, here's the result from my deviceQuery:
Device 0: "GeForce GTX 1080"
CUDA Driver Version / Runtime Version 10.2 / 10.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 8118 MBytes (8512602112 bytes)
(20) Multiprocessors, (128) CUDA Cores/MP: 2560 CUDA Cores
GPU Max Clock rate: 1734 MHz (1.73 GHz)
Memory Clock rate: 5005 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 79 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Please look into this and correct the GPU implementation!
Thanks, Ante
The CPU and GPU unit tests have very similar results, with the exception of correctness and dual_correctness: the cpu version is wrong on 0 of 10000, while the GPU version is wrong on 10000 of 10000.
Unit tests (GPU version): compiles: success correctness: wrong on 10000 of 10000, most errors range from 0.1 to 0.4. dual_correctness: same results as correctness (wrong 10000 of 10000), with similar errors multi_level:
fmmtl/unit_tests$ ./multi_level
WARNING: Expansion does not have a correct M2T!
WARNING: Expansion does not have a correct M2T!
rexact = 0.589133 -0.200385 -0.200385 -0.200385
rm2t1 = 0 0 0 0
[-0.589133 0.200385 0.200385 0.200385]
rm2t2 = 0 0 0 0
[-0.589133 0.200385 0.200385 0.200385]
rfmm = 0.589147 -0.200343 -0.200343 -0.200343
[1.44709e-05 4.28269e-05 4.28269e-05 4.28269e-05]
single_level:
fmmtl/unit_tests$ ./single_level
WARNING: Expansion does not have a correct M2T!
DIST: (0.8, 0.8, 0.8) : 1.38564
rexact = 0.589133 -0.200385 -0.200385 -0.200385
rm2t = 0 0 0 0
[-0.589133 0.200385 0.200385 0.200385]
rfmm = 0.589056 -0.201209 -0.201209 -0.201209
[-7.64378e-05 -0.00082331 -0.00082331 -0.00082331]
test_balltree: looks fine test_bbfmm:
fmmtl/unit_tests$ ./test_bbfmm
has_eval_op: 1
has_transpose: 1
has_vector_S2T_symm: 0
has_vector_S2T_asymm: 0
has_init_multipole: 1
has_init_local: 1
has_S2M: 1
has_scalar_S2M: 0
has_vector_S2M: 1
has_S2L: 0
has_scalar_S2L: 0
has_vector_S2L: 0
has_M2M: 1
has_M2L: 1
has_L2L: 1
has_M2T: 0
has_scalar_M2T: 0
has_vector_M2T: 0
has_L2T: 1
has_scalar_L2T: 0
has_vector_L2T: 1
has_dynamic_MAC: 0
FMM in 0.158716 secs
FMM in 0.0063701 secs
FMM in 0.0063367 secs
Computing direct matvec...
Direct in 0.226268 secs
Vector relative error: 5.69574e-01
Average relative error: 3.63698e+00
Maximum relative error: 4.98108e+03
test_direct: fine test_expansion: Lots of issues. Pasting them for 16LaplaceSpherical, for example:
16LaplaceSpherical:
has_eval_op: 1
has_transpose: 1
has_vector_S2T_symm: 0
has_vector_S2T_asymm: 0
has_init_multipole: 1
has_init_local: 1
has_S2M: 1
has_scalar_S2M: 1
has_vector_S2M: 0
has_S2L: 0
has_scalar_S2L: 0
has_vector_S2L: 0
has_M2M: 1
has_M2L: 1
has_L2L: 1
has_M2T: 0
has_scalar_M2T: 0
has_vector_M2T: 0
has_L2T: 1
has_scalar_L2T: 1
has_vector_L2T: 0
has_dynamic_MAC: 0
test_gpu:
fmmtl/unit_tests$ ./test_gpu
terminate called after throwing an instance of 'thrust::system::system_error'
what(): parallel_for failed: no kernel image is available for execution on the device
Aborted (core dumped)
test_kdtree: looks reasonable test_kernel: Just pasted a few:
10BiotSavart:
0 0 0
has_eval_op: 1
has_transpose: 1
has_vector_S2T_symm: 0
has_vector_S2T_asymm: 0
14RosenheadMoore:
0 0 0
has_eval_op: 1
has_transpose: 1
has_vector_S2T_symm: 0
has_vector_S2T_asymm: 0
test_ndtree: looks reasonable test_s2t:
fmmtl/unit_tests$ ./test_s2t
CPU-GPU:
Vector relative error: 1
Average relative error: 1
Maximum relative error: 1
CPU-GPU Blocked:
Vector relative error: 1
Average relative error: 1
Maximum relative error: 1
test_vec:
fmmtl/unit_tests$ ./test_vec
Is POD: 0
Is trivial: 0
Is standard layout: 1
0 0 0
0 8 20
21.5407
8.24
4.38972
3.14
1 2.1 3.14 2
version:
fmmtl/unit_tests$ ./version
Using Thrust v1.9