exo icon indicating copy to clipboard operation
exo copied to clipboard

The model is loaded repeatedly

Open belog2867 opened this issue 1 year ago • 7 comments

The model was loaded twice, and the 1B llama model took up more than 10g of ram, is this normal

新建 文本文档.txt

belog2867 avatar Feb 09 '25 10:02 belog2867

linux tinygrad

belog2867 avatar Feb 09 '25 12:02 belog2867

same problem!

llama-3-8b, llama-3.2-1b will load model into GPU memory twice!

In AWS g6, g4dn generation instance with Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.5.1 (Ubuntu 22.04) 20250202 AMI

yetisno avatar Feb 13 '25 14:02 yetisno

These looks like two different models. First one should be unloaded from memory when second one is loaded in.

AlexCheema avatar Feb 13 '25 23:02 AlexCheema

In my experiment, it's happened on every model I tried on linux instance.

cluster with only one node

it will load same model twice when send a request at first time.

cluster with multi nodes

The node which receive API request will load same model twice with the partition it should processed at first time.

the other nodes will load only one time, not twice.

load model --> load model into to GPU memory.

load same model twice -->  load same model into to GPU memory twice, occupied 2x memory.

you can see the log when scroll up the TUI.

just like the attachment @belog2867 posted, it's only happened at the node which receive API request.

yetisno avatar Feb 14 '25 04:02 yetisno

I just started seeing this today as well (on Jetson Orin Nano), no idea why. Trying to load llama 3.2 1b model took up all (8gb, but about 7.4 available) memory and locked up the device each time I attempted. After the first locked up, I watched the loading output and kept seeing it load all the way, then just start doing it again.

I noticed the model it was trying to use started with unsloth/restOfModelName. I can't say I remember seeing it being an unsloth model before, but that doesn't mean it wasn't. Just don't remember it.

MostHated avatar Feb 16 '25 04:02 MostHated

I'm experiencing the same issue on model loaded twice while running Tinygrad on Linux (Testing at a605e23).

Below is the error log when it runs out of memory: https://gist.github.com/oatawa1/78977d0b6835aa69f14344ca89a76b05

This commit (a174c78004e5fc62220dd4e9734e8ad4eaa75e39) is the latest one that loaded model once.

However, after this change (54605299b801abfb13b6d6d3805d05888460363f), I started getting a different error related to mlx. But I believe the model load twice might have already existed.

oatawa1 avatar Feb 20 '25 03:02 oatawa1

I also created a log of this occurring with the llama3.2 1b model on Jetson Orin Nano Super 8gb device.

If I do not have any debug flags on, ex. neither DEBUG=, or TINYGRAD_DEBUG= it will always lock up the device after the subsequent model loading occurs.

If I do have debug enabled, as I did in the log below, it took a good bit longer to load, but the device didn't lock up during the subsequent loading, which I found interesting.

You can see on line 928: https://gist.github.com/MostHated/9aa7845d95aba47baf27ab3a9080e184#file-exo-log-L928

It reaches 100% with ram used: 2.47 GB

Then at line 1546 100% again and ram used: 4.94 GB https://gist.github.com/MostHated/9aa7845d95aba47baf27ab3a9080e184#file-exo-log-L1546

I also noticed that after the initial model loading, there are these warnings a bit further down:

WARNING: not loading output.weight
WARNING: not loading freqs_cis
loaded weights in 5348.72 ms, 2.47 GB loaded at 0.46 GB/s

Then it starts loading the model again right after that.

Here is the full log.

https://gist.github.com/MostHated/9aa7845d95aba47baf27ab3a9080e184

MostHated avatar Feb 21 '25 15:02 MostHated

Should be fixed in 1.0

Evanev7 avatar Dec 18 '25 18:12 Evanev7