CUDA out of memroy - Benjamin.

Open lipere123 opened this issue 1 year ago • 2 comments

./exo-cli-3.1-70b.sh hello Go for : #!/bin/bash /usr/bin/curl --progress-bar --connect-timeout 1800 --max-time 1800 http://edgenode2:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "llama-3.1-70b", "messages": [{"role": "user", "content": "hello"}], "temperature": 0.7 }'

{"detail": "Error processing prompt (see logs with DEBUG>=2): <AioRpcError of RPC that terminated with:\n\tstatus = StatusCode.UNKNOWN\n\tdetails = "Unexpected <class 'RuntimeError'>: CUDA Error 2, out of memory"\n\tdebug_error_string = "UNKNOWN:Error received from peer {created_time:"2024-10-10T05:40:20.356329187+00:00", grpc_status:2, grpc_message:"Unexpected <class \'RuntimeError\'>: CUDA Error 2, out of memory"}"\n>"}

Oct 10 '24 05:10 lipere123

I'm also intermittently experiencing this, see: https://github.com/exo-explore/exo/issues/235.

Oct 10 '24 17:10 fullofcaffeine

Seems like unable properly split model into chunks so model can be portionally loaded across several nodes. (llama 3.1 8B unable split across 3x8GB GPUs)

Oct 28 '24 03:10 FFAMax