exo Unhealthy nodes may force host switch to CLANG. CUDA=1 will not enforce behavior

Unhealthy nodes may force host switch to CLANG. CUDA=1 will not enforce behavior

Open FFAMax opened this issue 4 months ago • 2 comments

Immediately after restart host may switch from GPU/CUDE device to CLANG and nothing can do except shutdown other peers to find who causing the issue. On unhealthy host issue detected by basic health checks:

nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 535.183

This condition broke entire network. Probably need additional checks to filter peers based on some criteria to reject incompatible peers.

Oct 28 '24 06:10 FFAMax

exo exo copied to clipboard

Unhealthy nodes may force host switch to CLANG. CUDA=1 will not enforce behavior

exo
exo copied to clipboard