exo
exo copied to clipboard
Unhealthy nodes may force host switch to CLANG. CUDA=1 will not enforce behavior
Immediately after restart host may switch from GPU/CUDE device to CLANG and nothing can do except shutdown other peers to find who causing the issue. On unhealthy host issue detected by basic health checks:
nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 535.183
This condition broke entire network. Probably need additional checks to filter peers based on some criteria to reject incompatible peers.