ILGPU icon indicating copy to clipboard operation
ILGPU copied to clipboard

[BUG]: CL exceptions in multi-threaded setup

Open mlemanczyk opened this issue 5 months ago • 1 comments

Describe the bug

Hello, I'd like to ask for help resolving a multi-threading issue. It can be duplicated using https://github.com/mlemanczyk/even-perfect-numbers-scanner/tree/codex/analyze-gc-instances-in-checkdivisors-method source code.

You can use CSV test file as input to EvenPerfectBitScanner. It's inside sorted-primes.zip file in the root of the repository.

Just run it with the parameters: .\EvenPerfectBitScanner.exe --increment=add --filter-p=./sorted-primes.csv --use-orders=false --prime=31 --max-prime=140000000 --mersenne=bydivisor --divisor-cycles-device=cpu --mersenne-device=cpu --order-device=gpu --primes-device=gpu --divisor-cycles-batch=131072 --gpu-prime-batch=1024 --threads=10240 --gpu-prime-threads=20480 --write-batch-size=1 --bydivisor-deltas-device=gpu --bydivisor-montgomery-device=cpu --block-size=6 --test

Depending on your hardware, you may need to adjust the number of rolling accelerators in PerfectNumberConstants.cs

I'm running it against thousands of threads, e.g. 16_384+. I'd like to share the accelerators between threads, with separate input / output device buffers and separate streams. But every time I try using more streams with an accelerator, sooner or later I'm running into memory access violation, copy to device or kernel launch CL exceptions. I've tried adding locks all around, especially for device memory allocations to prevent those, but nothing seems to help. I'm unsure if that is related to ILGPU, AMD drivers and/or my code issue.

`Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt. Repeat 2 times:

at ILGPU.Runtime.OpenCL.CLAPI_0.clEnqueueFillBuffer_Import(IntPtr, IntPtr, Void*, IntPtr, IntPtr, IntPtr, Int32, IntPtr*, IntPtr*)

at ILGPU.Runtime.OpenCL.CLAPI_0.clEnqueueFillBuffer(IntPtr, IntPtr, Void*, IntPtr, IntPtr, IntPtr, Int32, IntPtr*, IntPtr*) at ILGPU.Runtime.OpenCL.CLAPI.FillBuffer[[System.Byte, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]](ILGPU.Runtime.AcceleratorStream, IntPtr, Byte, IntPtr, IntPtr) at ILGPU.Runtime.OpenCL.CLMemoryBuffer.CLMemSet[[System.Byte, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]](ILGPU.Runtime.OpenCL.CLStream, Byte, ILGPU.ArrayView1<Byte> ByRef) at ILGPU.Runtime.OpenCL.CLMemoryBuffer.MemSet(ILGPU.Runtime.AcceleratorStream, Byte, ILGPU.ArrayView1<Byte> ByRef) at ILGPU.Runtime.MemoryBuffer.MemSet(ILGPU.Runtime.AcceleratorStream, Byte, Int64, Int64) at ILGPU.Runtime.ArrayViewExtensions.MemSet[[ILGPU.ArrayView1[[System.Int32, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]], ILGPU, Version=1.5.3.0, Culture=neutral, PublicKeyToken=null]](ILGPU.ArrayView1<Int32>, ILGPU.Runtime.AcceleratorStream, Byte) at ILGPU.Runtime.ArrayViewExtensions.MemSetToZero[[ILGPU.ArrayView1[[System.Int32, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]], ILGPU, Version=1.5.3.0, Culture=neutral, PublicKeyToken=null]](ILGPU.ArrayView1<Int32>, ILGPU.Runtime.AcceleratorStream) at ILGPU.Runtime.ArrayViewExtensions.MemSetToZero[[System.Int32, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]](ILGPU.Runtime.ArrayView1D2<Int32,Dense>, ILGPU.Runtime.AcceleratorStream) at PerfectNumbers.Core.HeuristicCombinedPrimeTester.HeuristicTrialDivisionGpuDetectsDivisor(PerfectNumbers.Core.Gpu.Accelerators.PrimeOrderCalculatorAccelerator, UInt64, UInt64, Byte) at PerfectNumbers.Core.HeuristicCombinedPrimeTester.IsPrimeGpu(PerfectNumbers.Core.Gpu.Accelerators.PrimeOrderCalculatorAccelerator, UInt64, UInt64, Byte) Finished processing 12611 Processing 12613 at PerfectNumbers.Core.HeuristicCombinedPrimeTester.IsPrimeGpu(PerfectNumbers.Core.Gpu.Accelerators.PrimeOrderCalculatorAccelerator, UInt64) at PerfectNumbers.Core.PrimeOrderCalculator.PartialFactor(PerfectNumbers.Core.Gpu.Accelerators.PrimeOrderCalculatorAccelerator, UInt64, PrimeOrderSearchConfig ByRef) at PerfectNumbers.Core.PrimeOrderCalculator.CalculateInternal(UInt64, System.Nullable1<UInt64>, PerfectNumbers.Core.MontgomeryDivisorData ByRef, PrimeOrderSearchConfig ByRef) at PerfectNumbers.Core.PrimeOrderCalculator.Calculate(UInt64, System.Nullable1<UInt64>, PerfectNumbers.Core.MontgomeryDivisorData ByRef, PrimeOrderSearchConfig ByRef, PrimeOrderHeuristicDevice) at PerfectNumbers.Core.MersenneDivisorCycles.TryCalculateCycleLengthForExponentCpu(PerfectNumbers.Core.Gpu.Accelerators.PrimeOrderCalculatorAccelerator, UInt64, UInt64, PerfectNumbers.Core.MontgomeryDivisorData ByRef, UInt64 ByRef, Boolean ByRef) at PerfectNumbers.Core.Cpu.MersenneNumberDivisorByDivisorCpuTester.CheckDivisors64(UInt64, UInt64, UInt64, UInt64, UInt16, Byte, Byte, Byte, Byte, Byte, Byte, Byte, Byte, Byte, Byte, Boolean ByRef) at PerfectNumbers.Core.Cpu.MersenneNumberDivisorByDivisorCpuTester.CheckDivisors(UInt64, UInt64, Boolean ByRef) at PerfectNumbers.Core.Cpu.MersenneNumberDivisorByDivisorCpuTester.IsPrime(UInt64, Boolean ByRef) at PerfectNumbers.Core.MersenneNumberDivisorByDivisorTester+<>c__DisplayClass0_0.<Run>g__ProcessPrime|0(UInt64) at PerfectNumbers.Core.MersenneNumberDivisorByDivisorTester+<>c__DisplayClass0_2.<Run>b__1() at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) at System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread) at System.Threading.Tasks.Task.ExecuteEntry() at PerfectNumbers.Core.UnboundedTaskScheduler.ExecuteTask(System.Threading.Tasks.Task) at PerfectNumbers.Core.TaskThreadPool.WorkerLoop()

May I ask for your help and input on this?

Environment

  • ILGPU version: 1.5.3
  • .NET version: .NET 8
  • Operating system: Windows 11
  • Hardware (if GPU-related): AMD Ryzen 7 integrated notebook card, 20 GB RAM.

Steps to reproduce

  1. Compile the solution
  2. Unzip sorted-primes.zip to bin folder of EvenPerfectBitScanner application
  3. Run EvenPerfectBitScanner with the following parameters. .\EvenPerfectBitScanner.exe --increment=add --filter-p=./sorted-primes.csv --use-orders=false --prime=31 --max-prime=140000000 --mersenne=bydivisor --divisor-cycles-device=cpu --mersenne-device=cpu --order-device=gpu --primes-device=gpu --divisor-cycles-batch=131072 --gpu-prime-batch=1024 --threads=10240 --gpu-prime-threads=20480 --write-batch-size=1 --bydivisor-deltas-device=gpu --bydivisor-montgomery-device=cpu --block-size=6 --test

Expected behavior

Multiple streams accessed by multiple threads (1 stream / 1 thread, xxx threads / 1 accelerator) on shared accelerators without exceptions.

Additional context

No response

mlemanczyk avatar Nov 19 '25 16:11 mlemanczyk

hi @mlemanczyk.

Your project is very large. Are you able to provide a simple example that reproduces the issue?

Alternatively, try running on the CPU accelerator, to see if it triggers any assertions.

MoFtZ avatar Dec 10 '25 02:12 MoFtZ

I'll try replicating it with a small sample. I'm under strong impression that the kernels are not thread-safe in a meaning that you need 1 kernel / 1 thread. Even when I put locks around launching the kernels, it didn't resolve the issue. At the same time, it happens not only during kernel launches, but also during copying from/to CPU. That's really the only issue preventing me from running more calculations on GPU. I'm using multiple accelerators to mitigate the issue, too. But I can't create more than 298 accelerators and the less accelerators I use, the more likely for it to happen. It could be some easy mistake on my side.

mlemanczyk avatar Dec 12 '25 00:12 mlemanczyk