FFAMax

Results 75 comments of FFAMax

Hello, Team. Anybody found solution to avoid `CUDA Error 2, out of memory`? ``` loaded weights in 4041.00 ms, 8.03 GB loaded at 1.99 GB/s Error processing tensor for shard...

In my case GPUs was not defined so it was unable properly proceed. Once FLOPs defined, it was able split according to available VRAM on all GPUs. Example https://github.com/exo-explore/exo/pull/393/files

With `DEBUG=8 TINYGRAD_DEBUG=8 DEBUG_DISCOVERY=8 exo` got some info: ``` Broadcasting presence at (127.0.0.1) Broadcasting presence at (10.1.3.177): {"type": "discovery", "node_id": "ce0c3546-20d9-4a2c-9e96-16c6894259fa", "grpc_port": 49868, "device_capabilities": {"model": "Linux Box (NVIDIA GEFORCE GTX...

> It might be worth trying the patch in #7376 Thanks, it helped! Full command to run in my case: `SUPPORT_BF16=0 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9 python3 examples/llama3.py --download_model --shard 10 --size 8B`

Now it failing on another issue but that's another story :D ``` ptxas fatal : SM version specified by .target is higher than default SM version assumed Failed to generate...

Is it should just translate config file to equivalent of CLI options like --listen-port --broadcast-port --discovery-module --discovery-timeout --wait-for-peers or the goal is add more options like - on what interface...

@lipere123 do you mind to clone repo and submit your changes so can clone and try/contribute?

> Can you double check the FP16 numbers here? Those look a little too low. They are usually halfway between the 8 and 32. For example take GTX 1080 Ti...

It was changed to 900 due failures on old HW like GTX 1080. As I see project mostly focused on Apple devices so for most people it may have no...

> Are you using tinygrad? Yes. That's a linux machine there therefore TinygradDynamicShardInferenceEngine picked up.