exo icon indicating copy to clipboard operation
exo copied to clipboard

Unhealthy nodes may force host switch to CLANG. CUDA=1 will not enforce behavior

Open FFAMax opened this issue 1 year ago • 2 comments

Immediately after restart host may switch from GPU/CUDE device to CLANG and nothing can do except shutdown other peers to find who causing the issue. On unhealthy host issue detected by basic health checks:

nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 535.183

This condition broke entire network. Probably need additional checks to filter peers based on some criteria to reject incompatible peers.

FFAMax avatar Oct 28 '24 06:10 FFAMax

CLANG and CUDA peers that use tinygrad are compatible.

AlexCheema avatar Oct 29 '24 20:10 AlexCheema

CLANG and CUDA peers that use tinygrad are compatible.

The case: on one node due some glitch GPU stopped responding (aka driver issue, GPU(HW) issue, etc), this make entire network unable to proceed until node detected and manually turned off.

By other words: can spoof incompatible peer and make network inoperationable

FFAMax avatar Oct 30 '24 05:10 FFAMax

Should be fixed in 1.0 barring CUDA support.

Evanev7 avatar Dec 18 '25 20:12 Evanev7