Self-test failures on Intel GPU
I'm using the current mfakto (revision 149eedbca284bf0ffef0d8668e98c08dce0ba3ab) on a system with Intel N95 CPU.
For the default build with the default configuration, I'm getting approximately 50% failure rate in the self-test that mfakto runs when invoked without any arguments. The exact number of failing tests varies, it could be 12 failures an 18 successes of the other way around. But the failing kernels always start with cl_barrett32_, for instance
ERROR: self-test failed for M51332417 (cl_barrett32_76_gs)
ERROR: self-test failed for M50896831 (cl_barrett32_76_gs)
ERROR: self-test failed for M51232133 (cl_barrett32_76_gs)
ERROR: self-test failed for M50752613 (cl_barrett32_76_gs)
ERROR: self-test failed for M51507913 (cl_barrett32_76_gs)
ERROR: self-test failed for M51916901 (cl_barrett32_76_gs)
ERROR: self-test failed for M51308501 (cl_barrett32_76_gs)
ERROR: self-test failed for M51671491 (cl_barrett32_76_gs)
ERROR: self-test failed for M48629519 (cl_barrett32_87_gs)
ERROR: self-test failed for M51752893 (cl_barrett32_87_gs)
ERROR: self-test failed for M51760133 (cl_barrett32_87_gs)
ERROR: self-test failed for M51090757 (cl_barrett32_87_gs)
ERROR: self-test failed for M53065231 (cl_barrett32_92_gs)
ERROR: self-test failed for M55069117 (cl_barrett32_76_gs)
ERROR: self-test failed for M45448679 (cl_barrett32_87_gs)
I tried changing options in mfakto.ini. It turns out that changing VectorSize=2 to VectorSize=1 changes the self-test errors. Now there are always 3 errors, always for the same exponents, and they all come from the cl_barrett15_69_gs kernel.
ERROR: self-test failed for M1031831 (cl_barrett15_69_gs)
ERROR: self-test failed for M3321929777 (cl_barrett15_69_gs)
ERROR: self-test failed for M3321930841 (cl_barrett15_69_gs)
I wonder if it would be possible to use VectorSize=2 in cl_barrett15_69_gs and VectorSize=1 elsewhere for the Intel GPUs.
I'm running Fedora Workstation 40 with up-to-date packages. I saw similar issues on Windows and or other systems with Linux.
I'm attaching more detailed logs:
dmesg | grep i915
dmesg-i915.txt
sudo lspci -v
lspci-v.txt
mfakto with VectorSize=2
mfakto-log-default.txt
mfakto with VectorSize=1
mfakto-log-vectorsize1.txt
clinfo
clinfo.txt
rpm -qa | grep -i opencl
opencl-packages.txt
The latest changes by @brubsby should fix the self-test failures. I consistently had four self-tests fail on Intel integrated GPUs on macOS, and Tyler's fork does have this issue.
The latest changes by @brubsby should fix the self-test failures. I consistently had four self-tests fail on Intel integrated GPUs on macOS, and Tyler's fork does have this issue.
I have never had 4 tests fail. It's either 3 test with VectorSize=1 or at least 9 tests with VectorSize=2.
If you mean #39, it makes no difference for me. I had to make a minor fixup to make it compile: https://github.com/brubsby/mfakto/pull/1
Do you mind to post information about the 4 failures you are getting? Do you know which of the commits by @brubsby is fixing them?
Well, this is strange... I just checked out the latest code without Tyler's changes, and the self-test failures on macOS are no longer reproducible. I even checked out a commit from six months ago, and all 30 self-tests still pass.
I suspect this is due to a change in the OpenCL drivers as I did upgrade my MacBook Pro from macOS Mojave to Ventura not long ago. However, these are the four self-tests that consistently failed:
ERROR: self-test failed for M51760133 (cl_barrett32_87_gs)
no factor found
ERROR: self-test failed for M51090757 (cl_barrett32_87_gs)
no factor found
ERROR: self-test failed for M50989481 (cl_barrett32_87_gs)
no factor found
ERROR: self-test failed for M50856937 (cl_barrett32_87_gs)
no factor found
The errors look like what I'm getting with VectorSize=2, but only for cl_barrett32_87_gs. Do you mind to try VectorSize=1?
Speaking of strange things, I believe mfakto was working on the same Intel N95 system when running Windows 10. Then I installed Intel video drivers and the self-test stopped passing. Then I removed the Intel drivers, but the tests kept failing. Of course, it's possible that something else changed at the same time.
I can also reproduce the same behavior on an older system running Fedora 40 with i7 CPU and 00:02.0 VGA compatible controller: Intel Corporation HD Graphics 5500 (rev 09).
Also, if I set SieveOnGPU=0 I get exactly the same errors, but the kernels don't have _gs at the end. So the issue is unlikely to be sieve related.
Using VectorSize=1 resulted in 12 failures on my Intel integrated GPU on macOS Ventura.
ERROR: self-test failed for M1031831 (cl_barrett15_69_gs)
no factor found
ERROR: self-test failed for M51332417 (cl_barrett15_69_gs)
no factor found
ERROR: self-test failed for M50896831 (cl_barrett15_71_gs)
no factor found
ERROR: self-test failed for M50979079 (cl_barrett15_73_gs)
no factor found
ERROR: self-test failed for M51232133 (cl_barrett15_73_gs)
no factor found
ERROR: self-test failed for M50830523 (cl_barrett15_73_gs)
no factor found
ERROR: self-test failed for M50752613 (cl_barrett15_73_gs)
no factor found
ERROR: self-test failed for M51507913 (cl_barrett15_73_gs)
no factor found
ERROR: self-test failed for M51916901 (cl_barrett15_74_gs)
no factor found
ERROR: self-test failed for M3321929777 (cl_barrett15_69_gs)
no factor found
ERROR: self-test failed for M3321930841 (cl_barrett15_69_gs)
no factor found
ERROR: self-test failed for M55069117 (cl_barrett15_69_gs)
no factor found
However, mfakto does warn this is a known issue on AMD devices.
Yes, I can trigger the warning by setting GPUType=GCN:
WARNING: VectorSize=1 is known to fail on AMD GPUs and drivers. If the selftest fails, please increase VectorSize to 2 at least. See http://community.amd.com/thread/167571 for latest news about this issue.
Apparently, Intel is affected as well. Unfortunately, the link is broken, even though I'm registered on the AMD Community site. It would be nice to update the link, but I don't know what to look for.
So basically we have two issues here:
- barrett15 kernels don't work with
VectorSize=1on both AMD and Intel GPUs - barrett32 kernels only work reliably on Intel GPUs with
VectorSize=1(but some users are lucky for obsure reasons)
If either issue is resolved, Intel GPUs can be used with mfakto. Obviously, it would be better to allow a greater VectorSize.
Another approach would be to use different VectorSize for different kernels, but that won't address the performance issue. It might be tricky to implement, as a lot of common code uses VECTOR_SIZE.
I was able to "fix" the issue by commenting out all failing kernels for the INTEL GPUs. The self-test started passing once I commented out every MUL32:
diff --git a/src/mfaktc.c b/src/mfaktc.c
index bbc40ad..72ee3e8 100644
--- a/src/mfaktc.c
+++ b/src/mfaktc.c
@@ -446,14 +446,14 @@ GPUKernels find_fastest_kernel(mystuff_t *mystuff, cl_uint do_test)
{
/* GPU_INTEL, (HD4600, v=4) */
MG62, // "cl_mg_62" (?)
- BARRETT76_MUL32, // "cl_barrett32_76" (26.04 M/s)
- BARRETT77_MUL32, // "cl_barrett32_77" (25.41 M/s)
- BARRETT87_MUL32, // "cl_barrett32_87" (22.13 M/s)
- BARRETT88_MUL32, // "cl_barrett32_88" (21.47 M/s)
- BARRETT79_MUL32, // "cl_barrett32_79" (20.91 M/s)
+ //BARRETT76_MUL32, // "cl_barrett32_76" (26.04 M/s)
+ //BARRETT77_MUL32, // "cl_barrett32_77" (25.41 M/s)
+ //BARRETT87_MUL32, // "cl_barrett32_87" (22.13 M/s)
+ //BARRETT88_MUL32, // "cl_barrett32_88" (21.47 M/s)
+ //BARRETT79_MUL32, // "cl_barrett32_79" (20.91 M/s)
BARRETT69_MUL15, // "cl_barrett15_69" (19.31 M/s)
BARRETT70_MUL15, // "cl_barrett15_70" (19.30 M/s)
- BARRETT92_MUL32, // "cl_barrett32_92" (18.41 M/s)
+ //BARRETT92_MUL32, // "cl_barrett32_92" (18.41 M/s)
BARRETT71_MUL15, // "cl_barrett15_71" (18.22 M/s)
// BARRETT70_MUL24, // "cl_barrett24_70" (17.12 M/s)
BARRETT73_MUL15, // "cl_barrett15_73" (16.31 M/s)
I tried rechecking a recently found factor: mfakto -tf 970894207 75 76
mfakto would fail after several minutes with repeating messages:
Error -5 (Out of resources): Enqueuing kernel(clEnqueueNDRangeKernel) CalcBitToClear
Error -5 (Out of resources): Enqueuing kernel (clEnqueueNDRangeKernel) SegSieve
Error -5 (Out of resources): Enqueuing kernel(clEnqueueNDRangeKernel)
Then I set SieveOnGPU=0. The test was running for a longer time, but reported ETA ~ 11 hours. I didn't finish it.
Using GPU kernel "cl_barrett15_82_2"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Aug 29 15:11 | 192 4.6% | 42.569 10h49m | 16.66 200000 64.16%
Another approach is to use VectorSize=1 and disable MUL15 kernels up to and including cl_barrett15_74. If that case, I got ETA about 6.5 hours for the same exponent with SieveOnGPU=1. I didn't see the Out of resources error.
Do we have any theories as to why these vectorsize failures are occurring?
- barrett15 kernels don't work with
VectorSize=1on both AMD and Intel GPUs
This one is on my radar, but I haven't had much time to take a deep dive. Any guesses to point me in the right direction would help.
As for the intel gpu only failures, I have an intel igpu, but for some reason I was thinking I wasn't affected, though maybe I can reproduce it.
Well, this is strange... I just checked out the latest code without Tyler's changes, and the self-test failures on macOS are no longer reproducible.
One possible reason is that you might have set your GPU type to one that doesn't use double precision operations, and then used that .ini on the older commit? I don't have a theory for how an openCL upgrade would've fixed it though.
I cloned the code into a new folder, so the INI file is definitely not from an old revision.
I tried Mesa instead of the Intel Compute Runtime, and the failures are gone! That seems to imply that they are in software that runs on the CPU, not in the Intel GPU itself.
I'm on Fedora 40, so I installed the mesa-libOpenCL-24.1.7-1.fc40.x86_64 package. It has two OpenCL platforms, Clover and Rusticl. I could only figure out how to enable Rusticl, it's done by setting RUSTICL_ENABLE=iris in the environment. Surprisingly, all 30 self-tests pass with the default mfakto.ini:
$ RUSTICL_ENABLE=iris ./mfakto -d 11
mfakto 0.16-beta.1 (64-bit build)
Runtime options
INI file mfakto.ini
Verbosity 1
SieveOnGPU yes
MoreClasses yes
GPUSievePrimes 81157
GPUSieveProcessSize 24 Kib
GPUSieveSize 96 Mib
FlushInterval 0
WorkFile worktodo.txt
ResultsFile results.txt
WARNING: Cannot read JSONResultsFile from INI file, using default (results.json.txt)
JSONResultsFile results.json.txt
WARNING: Cannot read LogFile from INI file, using default (mfakto.log)
LogFile mfakto.log
Checkpoints enabled
CheckpointDelay 300 s
Stages enabled
StopAfterFactor class
PrintMode compact
WARNING: Cannot read Logging from INI file, set to 0 by default
Logging disabled
V5UserID none
ComputerID none
TimeStampInResults yes
VectorSize 2
GPUType AUTO
SmallExp no
UseBinfile mfakto_Kernels.elf
Compile-time options
Select device - WARNING: OpenCL support via iris driver is incomplete.
For a complete and conformant OpenCL implementation, use
https://github.com/intel/compute-runtime instead
Get device info:
OpenCL device info
name Mesa Intel(R) Graphics (ADL-N) (Intel)
device (driver) version OpenCL 3.0 (24.1.7)
maximum threads per block 1024
maximum threads per grid 1073741824
number of multiprocessors 1 (1 compute elements)
clock rate 400 MHz
Automatic parameters
threads per grid 0
optimizing kernels for INTEL
Loading binary kernel file mfakto_Kernels.elf
Compiling kernels.
GPUSievePrimes (adjusted) 81206
GPUsieve minimum exponent 1037054
Started a simple self-test ...
######### test case 30/30 (M45448679[81-82]) ###########
Self-test statistics
number of tests 30
successful tests 30
self-test PASSED!
Can't open workfile worktodo.txt
ERROR: get_next_assignment(): can't open "worktodo.txt"
Moreover, ./mfakto -st passes.
However, the 15-bit barrett kernels fail with VectorSize=1 as before. Maybe there is something fundamentally wrong with their code.
I tried some actual factoring with Rusticl. Unfortunately, a few minutes into the run, messages appear that the GPU has been reset, then mfakto starts running at an unrealistic speed (>200000 GHz-d/d), finishes in seconds and reports no factor even where there is a factor.
The issue with 15-bit barrett kernels and VectorSize=1 is fixed in #46. It's unrelated to the hardware, I can reproduce it in pocl.
The issue with 32-bit barrett kernels is specific to Intel.
My progress so far can be seen in the intel-fix-wip1 branch in my fork: https://github.com/proski/mfakto/commits/intel-fix-wip1/
I limited selftests to one test (M50896831) and set SieveOnGPU=0. That makes the failure reproducible every time. The remaining settings are default.
I changed mod_simple_even_96_and_check_big_factor96 to return whether it has found a factor.
With rusticl, the factor is found for f.y=26:40b94688:5f612197:
factor: found=1, tid=1492488, a.x=c9:1c2b6094:f0c9939a, f.x=26:40b94464:e303f40f a.y=bf:439e60a9:dce5a7f4, f.y=26:40b94688:5f612197
With Intel OpenCL, the same value of f.y is seen, but a.y is apparently corrupted. Everything else is the same, even the tid.
factor: found=0, tid=1492488, a.x=c9:1c2b6094:f0c9939a, f.x=26:40b94464:e303f40f a.y=e6ee5f33:a99eb0e8:e2a03850, f.y=26:40b94688:5f612197
ERROR: self-test failed for M50896831 (cl_barrett32_76)
I plan to add more tracing above that point to see where the data gets corrupted.
I see that calling shl_96 from check_barrett32_76 introduces some randomness with Intel OpenCL. The result of the shift is wrong and unpredictable. I tried inlining shl_96, replacing amd_bitalign with shifts and using temporary variables. It appears that (a->d2 << 1) | (a->d1 >> 31) is not properly understood by the compiler. Using temporary variables make the calculation deterministic, but the factor is still not found.
The fix (WIP) is in https://github.com/primesearch/mfakto/pull/15 There is indeed some issue with vector shifts in the Intel OpenCL implementation. Closing this issue in favor of the PR.