CUDA out of memroy - Benjamin.
./exo-cli-3.1-70b.sh hello
Go for :
#!/bin/bash
/usr/bin/curl --progress-bar --connect-timeout 1800 --max-time 1800 http://edgenode2:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "llama-3.1-70b",
"messages": [{"role": "user", "content": "hello"}],
"temperature": 0.7
}'
{"detail": "Error processing prompt (see logs with DEBUG>=2): <AioRpcError of RPC that terminated with:\n\tstatus = StatusCode.UNKNOWN\n\tdetails = "Unexpected <class 'RuntimeError'>: CUDA Error 2, out of memory"\n\tdebug_error_string = "UNKNOWN:Error received from peer {created_time:"2024-10-10T05:40:20.356329187+00:00", grpc_status:2, grpc_message:"Unexpected <class \'RuntimeError\'>: CUDA Error 2, out of memory"}"\n>"}
I'm also intermittently experiencing this, see: https://github.com/exo-explore/exo/issues/235.
Seems like unable properly split model into chunks so model can be portionally loaded across several nodes. (llama 3.1 8B unable split across 3x8GB GPUs)