Unhealthy nodes may force host switch to CLANG. CUDA=1 will not enforce behavior
Immediately after restart host may switch from GPU/CUDE device to CLANG and nothing can do except shutdown other peers to find who causing the issue. On unhealthy host issue detected by basic health checks:
nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 535.183
This condition broke entire network. Probably need additional checks to filter peers based on some criteria to reject incompatible peers.
CLANG and CUDA peers that use tinygrad are compatible.
CLANG and CUDA peers that use tinygrad are compatible.
The case: on one node due some glitch GPU stopped responding (aka driver issue, GPU(HW) issue, etc), this make entire network unable to proceed until node detected and manually turned off.
By other words: can spoof incompatible peer and make network inoperationable
Should be fixed in 1.0 barring CUDA support.