wllama icon indicating copy to clipboard operation
wllama copied to clipboard

Should all models now be chunked?

Open flatsiedatsie opened this issue 1 year ago • 3 comments

I tried to load NeuralReyna, a relatively small model, but still got an out of memory issue.

Should ALL models be chunked, even ones smaller than 2Gb?

Screenshot 2024-05-12 at 11 44 23

Somewhat off-topic, but perhaps useful for others: I tried to do this, and chunk NeuralReyna. interestingly, it didn't want to be chunked into very small (100Mb) parts:

./gguf-split --split-max-size 100M ./neuralreyna-mini-1.8b-v0.3.q5_k_m.gguf neuralreyna-mini-1.8b-v0.3.q5_k_m     
error: one of splits have 0 tensors. Maybe size or tensors limit is too small

Even 200Mb was too small. Luckily, 250Mb worked.

flatsiedatsie avatar May 12 '24 09:05 flatsiedatsie

If you get OOM error from ggml, that means the browser doesn't want to give you more RAM. Probably you're also loading other models or instances of wllama at the same time.

Chunked model won't help in this case, since you're already used up all available RAM

ngxson avatar May 12 '24 17:05 ngxson

Also, context length n_ctx seems to be quite big, you should decrease it to save RAM.

ngxson avatar May 12 '24 17:05 ngxson

I noticed a significant benefit in splitting the models, mostly due to the cache size constraints of Safari. Mobile Safari has a cache limit of 300MB, while Desktop Safari has a limit of less than 1GB. If the model size exceeds the limit, the user has to re-download the model after refreshing the page. Besides that, as mentioned in the Readme, it helps reduce the time required to download the model.

felladrin avatar May 12 '24 18:05 felladrin