FLAMEGPU2 MultiThreadDeviceTest tempremental failures

MultiThreadDeviceTest tempremental failures

Open Robadob opened this issue 2 years ago • 1 comments

On my laptop's 980m, a Debug build is causing the test suite to crash within this test suite. It occurs because a cudaErrorLaunchTimeout CUDA error is thrown from a non-main thread (and hence can't be caught).

e.g.

[ RUN      ] MultiThreadDeviceTest.SameModelSeperateThread_Environment
C:\Users\Robadob\fgpu2\include\flamegpu/simulation/detail/CUDAErrorChecking.cuh(28): CUDA Error: C:/Users/Robadob/fgpu2/src/flamegpu/simulation/CUDASimulation.cu(950): cudaErrorLaunchTimeout the launch timed out and was terminated
C:\Users\Robadob\fgpu2\include\flamegpu/simulation/detail/CUC:\Users\Robadob\fgpu2\include\flamegpu/simulation/detail/CUDAErrorChecking.cuh(28): CUDA Error: C:/Users/Robadob/fgpu2/src/flamegpu/simulation/CUDASimulation.cu(950): cudaErrorLaunchTimeout the launch timed out and was terminated
DAErrorChecking.cuh(28): CUDA Error: C:/Users/Robadob/fgpu2/src/flamegpu/simulation/CUDASimulation.cu(950): cudaErrorLaunchTimeout the launch timed out and was terminated
CC:\Users\Robadob\fgpu2\include\flamegpu/simulation/detail/CUDAErrorChecking.cuh(28): CUDA Error: C:/Users/Robadob/fgpu2/src/flamegpu/simulation/CUDASimulation.cu(950): cudaErrorLaunchTimeout the launch timed out and was terminated
:\Users\Robadob\fgpu2\include\flamegpu/simulation/detail/CUDAErrorChecking.cuh(28): CUDA Error: C:/Users/Robadob/fgpu2/src/flamegpu/simulation/CUDASimulation.cu(950): cudaErrorLaunchTimeout the launch timed out and was terminated
C:\Users\Robadob\fgpu2\include\flamegpu/simulation/detail/CUDAErrorChecking.cuh(28): CUDA Error: C:/Users/Robadob/fgpu2/src/flamegpu/simulation/CUDASimulation.cu(950): cudaErrorLaunchTimeout the launch timed out and was terminated
C:/Users/Robadob/fgpu2/tests/test_cases/detail/test_multi_thread_device.cu(271): error: Value of: has_error_1
  Actual: true
Expected: false
C:\Users\Robadob\fgpu2\include\flamegpu/simulation/detail/CUDAErrorChecking.cuh(28): CUDA Error: C:/Users/Robadob/fgpu2/src/flamegpu/exception/FLAMEGPUDeviceException.cu(19): cudaErrorLaunchTimeout the launch timed out and was terminated

Rerunning the suite in isolation caused this test to pass, but SameModelMultiDevice_Environment to crash instead.

The tests were written using std::thread rather than ensemble, so can't be easily resolved to not crash but fail (using the ensemble error levels).

Jan 03 '23 15:01 Robadob

The first test runs the same simualtion of 10k agents 3 times in separate std:threads. The models include an agent function reading an environment property 10k times. A 980m has 16SMs, each capable of hsoting 2048 resident threads, but occupancy is likely sub 100%, and as the kernels are from separate launches SMs might not share blocks.

This is mostly likely a wddm timeout issue for older/smaller devices + debug builds, so scaling the test down would likely fix this for other maxwell era / smaller devices (which we don't regularly run code on anymore).

Quickest fix will (probably) be for users (rob) to increase the WDDM timeout to a non-default value. A cleaner fix would be to reduce the problem size on smaller devices to make sure it runs within the default wddm timeout, turning down the population size and / or the amount of work being done within the kernel, this will however reduce the runtime and likelyhood of overlap on larger mor modern devices (in a release build). Reliably testing races is not easy.

Jan 03 '23 15:01 ptheywood

FLAMEGPU2 FLAMEGPU2 copied to clipboard

MultiThreadDeviceTest tempremental failures

FLAMEGPU2
FLAMEGPU2 copied to clipboard