mfakto Self-test failures on Intel GPU

I'm using the current mfakto (revision 149eedbca284bf0ffef0d8668e98c08dce0ba3ab) on a system with Intel N95 CPU.

For the default build with the default configuration, I'm getting approximately 50% failure rate in the self-test that mfakto runs when invoked without any arguments. The exact number of failing tests varies, it could be 12 failures an 18 successes of the other way around. But the failing kernels always start with cl_barrett32_, for instance

ERROR: self-test failed for M51332417 (cl_barrett32_76_gs)
ERROR: self-test failed for M50896831 (cl_barrett32_76_gs)
ERROR: self-test failed for M51232133 (cl_barrett32_76_gs)
ERROR: self-test failed for M50752613 (cl_barrett32_76_gs)
ERROR: self-test failed for M51507913 (cl_barrett32_76_gs)
ERROR: self-test failed for M51916901 (cl_barrett32_76_gs)
ERROR: self-test failed for M51308501 (cl_barrett32_76_gs)
ERROR: self-test failed for M51671491 (cl_barrett32_76_gs)
ERROR: self-test failed for M48629519 (cl_barrett32_87_gs)
ERROR: self-test failed for M51752893 (cl_barrett32_87_gs)
ERROR: self-test failed for M51760133 (cl_barrett32_87_gs)
ERROR: self-test failed for M51090757 (cl_barrett32_87_gs)
ERROR: self-test failed for M53065231 (cl_barrett32_92_gs)
ERROR: self-test failed for M55069117 (cl_barrett32_76_gs)
ERROR: self-test failed for M45448679 (cl_barrett32_87_gs)

I tried changing options in mfakto.ini. It turns out that changing VectorSize=2 to VectorSize=1 changes the self-test errors. Now there are always 3 errors, always for the same exponents, and they all come from the cl_barrett15_69_gs kernel.

ERROR: self-test failed for M1031831 (cl_barrett15_69_gs)
ERROR: self-test failed for M3321929777 (cl_barrett15_69_gs)
ERROR: self-test failed for M3321930841 (cl_barrett15_69_gs)

I wonder if it would be possible to use VectorSize=2 in cl_barrett15_69_gs and VectorSize=1 elsewhere for the Intel GPUs.

I'm running Fedora Workstation 40 with up-to-date packages. I saw similar issues on Windows and or other systems with Linux.

I'm attaching more detailed logs:

dmesg | grep i915 dmesg-i915.txt

sudo lspci -v lspci-v.txt

mfakto with VectorSize=2 mfakto-log-default.txt

mfakto with VectorSize=1 mfakto-log-vectorsize1.txt

clinfo clinfo.txt

rpm -qa | grep -i opencl opencl-packages.txt

Aug 27 '24 06:08 proski

The latest changes by @brubsby should fix the self-test failures. I consistently had four self-tests fail on Intel integrated GPUs on macOS, and Tyler's fork does have this issue.

Aug 27 '24 23:08 ixfd64

The latest changes by @brubsby should fix the self-test failures. I consistently had four self-tests fail on Intel integrated GPUs on macOS, and Tyler's fork does have this issue.

I have never had 4 tests fail. It's either 3 test with VectorSize=1 or at least 9 tests with VectorSize=2.

If you mean #39, it makes no difference for me. I had to make a minor fixup to make it compile: https://github.com/brubsby/mfakto/pull/1

Do you mind to post information about the 4 failures you are getting? Do you know which of the commits by @brubsby is fixing them?

Aug 28 '24 01:08 proski

Well, this is strange... I just checked out the latest code without Tyler's changes, and the self-test failures on macOS are no longer reproducible. I even checked out a commit from six months ago, and all 30 self-tests still pass.

I suspect this is due to a change in the OpenCL drivers as I did upgrade my MacBook Pro from macOS Mojave to Ventura not long ago. However, these are the four self-tests that consistently failed:

ERROR: self-test failed for M51760133 (cl_barrett32_87_gs)
no factor found
ERROR: self-test failed for M51090757 (cl_barrett32_87_gs)
no factor found
ERROR: self-test failed for M50989481 (cl_barrett32_87_gs)
no factor found
ERROR: self-test failed for M50856937 (cl_barrett32_87_gs)
no factor found

Aug 28 '24 02:08 ixfd64

The errors look like what I'm getting with VectorSize=2, but only for cl_barrett32_87_gs. Do you mind to try VectorSize=1?

Speaking of strange things, I believe mfakto was working on the same Intel N95 system when running Windows 10. Then I installed Intel video drivers and the self-test stopped passing. Then I removed the Intel drivers, but the tests kept failing. Of course, it's possible that something else changed at the same time.

I can also reproduce the same behavior on an older system running Fedora 40 with i7 CPU and 00:02.0 VGA compatible controller: Intel Corporation HD Graphics 5500 (rev 09).

Also, if I set SieveOnGPU=0 I get exactly the same errors, but the kernels don't have _gs at the end. So the issue is unlikely to be sieve related.

Aug 28 '24 03:08 proski

Using VectorSize=1 resulted in 12 failures on my Intel integrated GPU on macOS Ventura.

ERROR: self-test failed for M1031831 (cl_barrett15_69_gs)     
  no factor found
ERROR: self-test failed for M51332417 (cl_barrett15_69_gs)     
  no factor found
ERROR: self-test failed for M50896831 (cl_barrett15_71_gs)     
  no factor found
ERROR: self-test failed for M50979079 (cl_barrett15_73_gs)     
  no factor found
ERROR: self-test failed for M51232133 (cl_barrett15_73_gs)     
  no factor found
ERROR: self-test failed for M50830523 (cl_barrett15_73_gs)     
  no factor found
ERROR: self-test failed for M50752613 (cl_barrett15_73_gs)     
  no factor found
ERROR: self-test failed for M51507913 (cl_barrett15_73_gs)     
  no factor found
ERROR: self-test failed for M51916901 (cl_barrett15_74_gs)     
  no factor found
ERROR: self-test failed for M3321929777 (cl_barrett15_69_gs)     
  no factor found
ERROR: self-test failed for M3321930841 (cl_barrett15_69_gs)     
  no factor found
ERROR: self-test failed for M55069117 (cl_barrett15_69_gs)     
  no factor found

However, mfakto does warn this is a known issue on AMD devices.

Aug 28 '24 04:08 ixfd64

Yes, I can trigger the warning by setting GPUType=GCN:

WARNING: VectorSize=1 is known to fail on AMD GPUs and drivers. If the selftest fails, please increase VectorSize to 2 at least. See http://community.amd.com/thread/167571 for latest news about this issue.

Apparently, Intel is affected as well. Unfortunately, the link is broken, even though I'm registered on the AMD Community site. It would be nice to update the link, but I don't know what to look for.

So basically we have two issues here:

barrett15 kernels don't work with VectorSize=1 on both AMD and Intel GPUs
barrett32 kernels only work reliably on Intel GPUs with VectorSize=1 (but some users are lucky for obsure reasons)

If either issue is resolved, Intel GPUs can be used with mfakto. Obviously, it would be better to allow a greater VectorSize.

Another approach would be to use different VectorSize for different kernels, but that won't address the performance issue. It might be tricky to implement, as a lot of common code uses VECTOR_SIZE.

Aug 28 '24 07:08 proski

I was able to "fix" the issue by commenting out all failing kernels for the INTEL GPUs. The self-test started passing once I commented out every MUL32:

diff --git a/src/mfaktc.c b/src/mfaktc.c
index bbc40ad..72ee3e8 100644
--- a/src/mfaktc.c
+++ b/src/mfaktc.c
@@ -446,14 +446,14 @@ GPUKernels find_fastest_kernel(mystuff_t *mystuff, cl_uint do_test)
     {
 /*  GPU_INTEL, (HD4600, v=4) */
       MG62,             // "cl_mg_62"        (?)
-      BARRETT76_MUL32,  // "cl_barrett32_76" (26.04 M/s)
-      BARRETT77_MUL32,  // "cl_barrett32_77" (25.41 M/s)
-      BARRETT87_MUL32,  // "cl_barrett32_87" (22.13 M/s)
-      BARRETT88_MUL32,  // "cl_barrett32_88" (21.47 M/s)
-      BARRETT79_MUL32,  // "cl_barrett32_79" (20.91 M/s)
+      //BARRETT76_MUL32,  // "cl_barrett32_76" (26.04 M/s)
+      //BARRETT77_MUL32,  // "cl_barrett32_77" (25.41 M/s)
+      //BARRETT87_MUL32,  // "cl_barrett32_87" (22.13 M/s)
+      //BARRETT88_MUL32,  // "cl_barrett32_88" (21.47 M/s)
+      //BARRETT79_MUL32,  // "cl_barrett32_79" (20.91 M/s)
       BARRETT69_MUL15,  // "cl_barrett15_69" (19.31 M/s)
       BARRETT70_MUL15,  // "cl_barrett15_70" (19.30 M/s)
-      BARRETT92_MUL32,  // "cl_barrett32_92" (18.41 M/s)
+      //BARRETT92_MUL32,  // "cl_barrett32_92" (18.41 M/s)
       BARRETT71_MUL15,  // "cl_barrett15_71" (18.22 M/s)
 //      BARRETT70_MUL24,  // "cl_barrett24_70" (17.12 M/s)
       BARRETT73_MUL15,  // "cl_barrett15_73" (16.31 M/s)

I tried rechecking a recently found factor: mfakto -tf 970894207 75 76

mfakto would fail after several minutes with repeating messages:

Error -5 (Out of resources): Enqueuing kernel(clEnqueueNDRangeKernel) CalcBitToClear
Error -5 (Out of resources): Enqueuing kernel (clEnqueueNDRangeKernel) SegSieve
Error -5 (Out of resources): Enqueuing kernel(clEnqueueNDRangeKernel)

Then I set SieveOnGPU=0. The test was running for a longer time, but reported ETA ~ 11 hours. I didn't finish it.

Using GPU kernel "cl_barrett15_82_2"
Date    Time | class   Pct |   time     ETA | GHz-d/day    Sieve     Wait
Aug 29 15:11 |  192   4.6% | 42.569  10h49m |     16.66   200000   64.16%

Another approach is to use VectorSize=1 and disable MUL15 kernels up to and including cl_barrett15_74. If that case, I got ETA about 6.5 hours for the same exponent with SieveOnGPU=1. I didn't see the Out of resources error.

Aug 29 '24 22:08 proski

Do we have any theories as to why these vectorsize failures are occurring?

barrett15 kernels don't work with VectorSize=1 on both AMD and Intel GPUs

This one is on my radar, but I haven't had much time to take a deep dive. Any guesses to point me in the right direction would help.

As for the intel gpu only failures, I have an intel igpu, but for some reason I was thinking I wasn't affected, though maybe I can reproduce it.

Well, this is strange... I just checked out the latest code without Tyler's changes, and the self-test failures on macOS are no longer reproducible.

One possible reason is that you might have set your GPU type to one that doesn't use double precision operations, and then used that .ini on the older commit? I don't have a theory for how an openCL upgrade would've fixed it though.

Aug 30 '24 13:08 brubsby

I cloned the code into a new folder, so the INI file is definitely not from an old revision.

Aug 30 '24 17:08 ixfd64

I tried Mesa instead of the Intel Compute Runtime, and the failures are gone! That seems to imply that they are in software that runs on the CPU, not in the Intel GPU itself.

I'm on Fedora 40, so I installed the mesa-libOpenCL-24.1.7-1.fc40.x86_64 package. It has two OpenCL platforms, Clover and Rusticl. I could only figure out how to enable Rusticl, it's done by setting RUSTICL_ENABLE=iris in the environment. Surprisingly, all 30 self-tests pass with the default mfakto.ini:

$ RUSTICL_ENABLE=iris ./mfakto -d 11
mfakto 0.16-beta.1 (64-bit build)


Runtime options
  INI file                  mfakto.ini
  Verbosity                 1
  SieveOnGPU                yes
  MoreClasses               yes
  GPUSievePrimes            81157
  GPUSieveProcessSize       24 Kib
  GPUSieveSize              96 Mib
  FlushInterval             0
  WorkFile                  worktodo.txt
  ResultsFile               results.txt
WARNING: Cannot read JSONResultsFile from INI file, using default (results.json.txt)
  JSONResultsFile           results.json.txt
WARNING: Cannot read LogFile from INI file, using default (mfakto.log)
  LogFile                   mfakto.log
  Checkpoints               enabled
  CheckpointDelay           300 s
  Stages                    enabled
  StopAfterFactor           class
  PrintMode                 compact
WARNING: Cannot read Logging from INI file, set to 0 by default
  Logging                   disabled
  V5UserID                  none
  ComputerID                none
  TimeStampInResults        yes
  VectorSize                2
  GPUType                   AUTO
  SmallExp                  no
  UseBinfile                mfakto_Kernels.elf
Compile-time options

Select device - WARNING: OpenCL support via iris driver is incomplete.
For a complete and conformant OpenCL implementation, use
https://github.com/intel/compute-runtime instead
Get device info:

OpenCL device info
  name                      Mesa Intel(R) Graphics (ADL-N) (Intel)
  device (driver) version   OpenCL 3.0  (24.1.7)
  maximum threads per block 1024
  maximum threads per grid  1073741824
  number of multiprocessors 1 (1 compute elements)
  clock rate                400 MHz

Automatic parameters
  threads per grid          0
  optimizing kernels for    INTEL

Loading binary kernel file mfakto_Kernels.elf
Compiling kernels.
  GPUSievePrimes (adjusted) 81206
  GPUsieve minimum exponent 1037054
Started a simple self-test ...
######### test case 30/30 (M45448679[81-82]) ###########
Self-test statistics
  number of tests           30
  successful tests          30

self-test PASSED!

Can't open workfile worktodo.txt
ERROR: get_next_assignment(): can't open "worktodo.txt"

Moreover, ./mfakto -st passes.

However, the 15-bit barrett kernels fail with VectorSize=1 as before. Maybe there is something fundamentally wrong with their code.

I tried some actual factoring with Rusticl. Unfortunately, a few minutes into the run, messages appear that the GPU has been reset, then mfakto starts running at an unrealistic speed (>200000 GHz-d/d), finishes in seconds and reports no factor even where there is a factor.

Aug 31 '24 20:08 proski

The issue with 15-bit barrett kernels and VectorSize=1 is fixed in #46. It's unrelated to the hardware, I can reproduce it in pocl.

The issue with 32-bit barrett kernels is specific to Intel.

Sep 26 '24 08:09 proski

My progress so far can be seen in the intel-fix-wip1 branch in my fork: https://github.com/proski/mfakto/commits/intel-fix-wip1/

I limited selftests to one test (M50896831) and set SieveOnGPU=0. That makes the failure reproducible every time. The remaining settings are default.

I changed mod_simple_even_96_and_check_big_factor96 to return whether it has found a factor.

With rusticl, the factor is found for f.y=26:40b94688:5f612197:

factor: found=1, tid=1492488, a.x=c9:1c2b6094:f0c9939a, f.x=26:40b94464:e303f40f a.y=bf:439e60a9:dce5a7f4, f.y=26:40b94688:5f612197

With Intel OpenCL, the same value of f.y is seen, but a.y is apparently corrupted. Everything else is the same, even the tid.

factor: found=0, tid=1492488, a.x=c9:1c2b6094:f0c9939a, f.x=26:40b94464:e303f40f a.y=e6ee5f33:a99eb0e8:e2a03850, f.y=26:40b94688:5f612197
ERROR: self-test failed for M50896831 (cl_barrett32_76)

I plan to add more tracing above that point to see where the data gets corrupted.

Oct 02 '24 08:10 proski

I see that calling shl_96 from check_barrett32_76 introduces some randomness with Intel OpenCL. The result of the shift is wrong and unpredictable. I tried inlining shl_96, replacing amd_bitalign with shifts and using temporary variables. It appears that (a->d2 << 1) | (a->d1 >> 31) is not properly understood by the compiler. Using temporary variables make the calculation deterministic, but the factor is still not found.

Oct 03 '24 07:10 proski

The fix (WIP) is in https://github.com/primesearch/mfakto/pull/15 There is indeed some issue with vector shifts in the Intel OpenCL implementation. Closing this issue in favor of the PR.

Dec 26 '24 07:12 proski