FLAMEGPU2
FLAMEGPU2 copied to clipboard
MultiThreadDeviceTest tempremental failures
On my laptop's 980m, a Debug build is causing the test suite to crash within this test suite. It occurs because a cudaErrorLaunchTimeout
CUDA error is thrown from a non-main thread (and hence can't be caught).
e.g.
[ RUN ] MultiThreadDeviceTest.SameModelSeperateThread_Environment
C:\Users\Robadob\fgpu2\include\flamegpu/simulation/detail/CUDAErrorChecking.cuh(28): CUDA Error: C:/Users/Robadob/fgpu2/src/flamegpu/simulation/CUDASimulation.cu(950): cudaErrorLaunchTimeout the launch timed out and was terminated
C:\Users\Robadob\fgpu2\include\flamegpu/simulation/detail/CUC:\Users\Robadob\fgpu2\include\flamegpu/simulation/detail/CUDAErrorChecking.cuh(28): CUDA Error: C:/Users/Robadob/fgpu2/src/flamegpu/simulation/CUDASimulation.cu(950): cudaErrorLaunchTimeout the launch timed out and was terminated
DAErrorChecking.cuh(28): CUDA Error: C:/Users/Robadob/fgpu2/src/flamegpu/simulation/CUDASimulation.cu(950): cudaErrorLaunchTimeout the launch timed out and was terminated
CC:\Users\Robadob\fgpu2\include\flamegpu/simulation/detail/CUDAErrorChecking.cuh(28): CUDA Error: C:/Users/Robadob/fgpu2/src/flamegpu/simulation/CUDASimulation.cu(950): cudaErrorLaunchTimeout the launch timed out and was terminated
:\Users\Robadob\fgpu2\include\flamegpu/simulation/detail/CUDAErrorChecking.cuh(28): CUDA Error: C:/Users/Robadob/fgpu2/src/flamegpu/simulation/CUDASimulation.cu(950): cudaErrorLaunchTimeout the launch timed out and was terminated
C:\Users\Robadob\fgpu2\include\flamegpu/simulation/detail/CUDAErrorChecking.cuh(28): CUDA Error: C:/Users/Robadob/fgpu2/src/flamegpu/simulation/CUDASimulation.cu(950): cudaErrorLaunchTimeout the launch timed out and was terminated
C:/Users/Robadob/fgpu2/tests/test_cases/detail/test_multi_thread_device.cu(271): error: Value of: has_error_1
Actual: true
Expected: false
C:\Users\Robadob\fgpu2\include\flamegpu/simulation/detail/CUDAErrorChecking.cuh(28): CUDA Error: C:/Users/Robadob/fgpu2/src/flamegpu/exception/FLAMEGPUDeviceException.cu(19): cudaErrorLaunchTimeout the launch timed out and was terminated
Rerunning the suite in isolation caused this test to pass, but SameModelMultiDevice_Environment
to crash instead.
The tests were written using std::thread
rather than ensemble, so can't be easily resolved to not crash but fail (using the ensemble error levels).
The first test runs the same simualtion of 10k agents 3 times in separate std:threads. The models include an agent function reading an environment property 10k times. A 980m has 16SMs, each capable of hsoting 2048 resident threads, but occupancy is likely sub 100%, and as the kernels are from separate launches SMs might not share blocks.
This is mostly likely a wddm timeout issue for older/smaller devices + debug builds, so scaling the test down would likely fix this for other maxwell era / smaller devices (which we don't regularly run code on anymore).
Quickest fix will (probably) be for users (rob) to increase the WDDM timeout to a non-default value. A cleaner fix would be to reduce the problem size on smaller devices to make sure it runs within the default wddm timeout, turning down the population size and / or the amount of work being done within the kernel, this will however reduce the runtime and likelyhood of overlap on larger mor modern devices (in a release build). Reliably testing races is not easy.