wllama icon indicating copy to clipboard operation
wllama copied to clipboard

PostMessage: Data cannot be cloned, out of memory

Open flatsiedatsie opened this issue 1 year ago • 23 comments

I'm trying to load Mistral 7B 32K. I've chunked the 4.3GB model and uploaded it to huggingface.

When the download is seemingly complete, there is a warning about being out of memory:

Screenshot 2024-05-04 at 18 36 24

It's a little odd, as I normally load bigger chunked models (Llama 8B) with WebLLM. The task manager also indicates that memory pressure is medium.

flatsiedatsie avatar May 04 '24 16:05 flatsiedatsie

Yes seems like the issue is due to the way multiple files are being copied onto web worker: we're currently copying all shards at once, which may cause it to run out of memory. The fix would be:

  • Copy one file at once ==> This is the easy fix
  • Even better, we can move download function to webworker, so no copy is needed (but with the cost of making harder to control) ==> The hard way to fix
  • Maybe use SharedArrayBuffer when possible to avoid copy ==> Should be easy to implement

ngxson avatar May 04 '24 18:05 ngxson

This is becoming a bit of a show stopper unfortunately. It seems to even affect small models that would load under llama_cpp_wasm , such as NeuralReyna :-(

If you could help fix this issue, or give some pointers on how I could attempt to do so myself, that would be greatly appreciated. At this point I don't mind if a fix is slow or sub-optimal. I just want wllama to be reliable.

flatsiedatsie avatar May 08 '24 08:05 flatsiedatsie

I'm planning to work on this issue in next days. It maybe more complicated than it looks, so I'll need time to figure that out. Please be patient.

ngxson avatar May 08 '24 09:05 ngxson

That's great news! Thank you so much!

flatsiedatsie avatar May 08 '24 09:05 flatsiedatsie

FYI, v1.7.0 has been released. It also come with support for progressCallback, please see "advanced" example:

https://github.com/ngxson/wllama/blob/d1ceeb6d38a076045262aeb18a36f7572aaa90d6/examples/advanced/index.html#L53-L57

This issue (out-of-memory) is hopefully fixed by #14 , but I'm not 100% sure. Please try again & let me know if it works.

Also, it's now recommended to split the model into chunks of 256MB or 512MB. Again, see "advanced" example:

https://github.com/ngxson/wllama/blob/d1ceeb6d38a076045262aeb18a36f7572aaa90d6/examples/advanced/index.html#L38-L45

Also have a look at updated README: https://github.com/ngxson/wllama/tree/master?tab=readme-ov-file#prepare-your-model

Thank you!

ngxson avatar May 10 '24 09:05 ngxson

The readme mentions the progress feature (very nice bonus, thank you!), but just to be sure: does this also address the memory issue? Or is the intended fix for that to make the chunks smaller?

Ah, reading again..

Also, it's now recommended to split the model into chunks of 256MB or 512MB.

OK, I'll do that. Thank you.

flatsiedatsie avatar May 11 '24 07:05 flatsiedatsie

I'm seeing this error after creating a chunked model of Open Buddy Mistral 7B 32k Q4_K_M with 50 x 100Mb chunks:

Screenshot 2024-05-12 at 00 09 19 Screenshot 2024-05-12 at 00 13 20
		"download_url":[
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00001-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00002-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00003-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00004-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00005-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00006-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00007-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00008-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00009-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00010-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00011-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00012-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00013-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00014-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00015-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00016-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00017-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00018-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00019-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00020-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00021-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00022-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00023-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00024-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00025-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00026-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00027-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00028-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00029-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00030-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00031-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00032-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00033-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00034-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00035-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00036-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00037-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00038-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00039-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00040-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00041-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00042-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00043-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00044-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00045-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00046-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00047-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00048-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00049-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00050-of-00050.gguf",
		],

flatsiedatsie avatar May 11 '24 22:05 flatsiedatsie

I'm seeing this error after creating a chunked model of Open Buddy Mistral 7B 32k Q4_K_M with 50 x 100Mb chunks:

@flatsiedatsie FYI, I uploaded v1.8.0 which should display a better error message (I don't know if it fixes the mentioned issue or not). Could you try again and see what's the error? Thanks.

ngxson avatar May 12 '24 22:05 ngxson

I will. I've been trying lots of things actually. But unfortunately still having trouble loading models that WebLLM does load.

The following screenshots are not so much bugs, I've managed to solve some (basically by reducing context size).

Screenshot 2024-05-13 at 00 14 11

Screenshot 2024-05-13 at 00 14 33

Screenshot 2024-05-13 at 00 15 06 Screenshot 2024-05-13 at 00 47 11

Screenshot 2024-05-13 at 00 52 58

Screenshot 2024-05-13 at 00 53 24

Screenshot 2024-05-13 at 00 53 40

flatsiedatsie avatar May 13 '24 11:05 flatsiedatsie

I'm also still looking into your suggestion that it may be that the model is trying to load twice.

flatsiedatsie avatar May 13 '24 11:05 flatsiedatsie

You screenshot still shows "_wllama_decode_exception", which is already been removed in 1.8.0. Maybe your code is not using the latest version.

ngxson avatar May 13 '24 11:05 ngxson

Correct, those are screenshots from yesterdy. I'm updating it now.

flatsiedatsie avatar May 13 '24 13:05 flatsiedatsie

OK, I've done some more testing. TL/DR: Thing are running a lot smoother now! It's just the big models or big contexts that run out of memory.

But before I get into that, let me give a little context about what I'm trying to achieve. I'm trying to create a 100% browser-based online tool where people can not only chat with AI, but use it to work on documents. For that I need two types of models:

  1. A small model with a huge context for summarization tasks.
  • Small: Danube with 8K context is great for memory-poor mobile phones.
  • Medium: NeuralReyna, is a step up, as it has a 32K context.
  • Large: Phi 3 with 128K context.
  1. Use a large model with a relatively small context for more demanding tasks, like rewriting part of a document in a different tone.
  • Small: I'm not sure yet.
  • Medium: Mistral 7B with 4K
  • Large: Llama 3 8B is the top of the line.

Mistral 7B with 32K context could be a good "middle of the road do-it-all" option, so I've been trying to run that with Wllama today.

I started with by using your example code to eliminate the possiblity of bugs in my project being the cause of issues. I also rebooted my laptop first (Macbook Pro with 16Gb of ram) to have as much available memory as possible. Once I found that I got the same results with the example as with my code, I mostly reverted back to my project.

  1. Qwen 0.5 (GGUF)

The only model I've been able to get to work with 16K context. Crashes on it's theoretical maximum, 32K. Screenshot 2024-05-13 at 21 17 40

  1. NeuralReyna

In my main code I can now load NeuralReyna. Howver, if I try to use the full 32K, or even 16K, there are once again memory issues. With 8K it doesn't crash.

  1. Phi 3 - 4K

I chunked it in 250Mb parts, and it loads! Nice!

  1. Phi 3 - 128K

Here I tried to directly load a 1.96Gb .gguf file (Q3_K) and even that worked! This is pretty great, as Llama.cpp support for this model is right around the corner.

To be clear, I used it with a 4K context, since Llama.cpp doesn't support bigger context yet.

  1. Mistral 7B - 32K

This model has memory issues. To make sure it wasn't my code I tried loading the model in the advanced example too. Same result. Even setting the context to 1K doesn't help. The chunks I'm using are available here: https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked

With version 1.8 Wllama doesn't seem to raise an error though? It just just states the issue in the console. But my code thinks the model has loaded OK, even though it hasn't. Is there a way to get the failed state?

Screenshot 2024-05-13 at 18 56 48

In summary, only the bigger models/contexts now seem to run into issues.

  • You could argue the model is just "too big". But from using WebLLM I know that it should be possible to run it in the browser, with memory to spare. Similarly, the even bigger Llama 3 8B 4K can run under WebLLM. And since the Mac's have unified memory, I can't blame it on WebLLM offloading it to graphics card memory, right?

  • You could argue "Well, run Mistral 7B through WebLLM then". But WebLLM only runs when a WebGPU is available. It would be awesome to seemlessly switch between Wllama and WebLLM in the background, depending on WebGPU support.

I stilll have to test what happens on devices with less memory (e.g. 8Gb Macbook Air).

Finally, I just want to say: thank you for all your work on this! It's such an exciting development. Everybody talks about bringing AI to the masses, but too few people realize the browser is the best platform to do that with. Wllama is awesome!

flatsiedatsie avatar May 13 '24 17:05 flatsiedatsie

Just a quick check:

  • Is it reasonable to set the n_ctx and n_seq_max to the same value? In the advanced example you only seem to set n_ctx. Do you recommend doing the same?
  • Is it reasonable to have n_batch hardcoded at 1024?

flatsiedatsie avatar May 13 '24 20:05 flatsiedatsie

Thank you for the very detailed info!

It's true that we will definitely be struggle with the memory issue, because AFAIK browsers does have some limits on memory usage. Optimizing memory usage will surely be an area that I'll need to invest my time into.

Howver, if I try to use the full 32K, or even 16K, there are once again memory issues. With 8K it doesn't crash.

FYI, n_ctx doesn't have to be power of 2. It can be multiple of 1024, for example 10 * 1024 (= 10K)

Another trick to reduce memory usage is by using quantize q4_0 for cache_type_k, for example:

wllama.loadModelFromUrl(MODEL, {
  n_ctx: 10 * 1024,
  cache_type_k: 'q4_0',
});

WebLLM offloading it to graphics card memory, right?

Yes, WebLLM offload model weight and KV cache to GPU (not just apple silicon, but also nvidia/AMD/Intel Arc GPUs). I couldn't find on google what's the hard limit for WebGPU memory, so I suppose that it can use all available GPU VRAM.

It would be ideal to have support of WebGPU built directly into llama.cpp itself, but that far too complicated, so for now there's not much choice left for us.

  • Is it reasonable to set the n_ctx and n_seq_max to the same value? In the advanced example you only seem to set n_ctx. Do you recommend doing the same?

n_seq_max is always 1 and should not be modified (I should remove it in next release). The reason is because n_seq_max controls number of sequences can be processed in one batch. This is only useful when you have a big server that processes multiple requests at the same time (provided that the server have a beefy nvidia GPU). In our case, we only have single user at one time, so multi-sequence will decrease the performance.

  • Is it reasonable to have n_batch hardcoded at 1024?

If you're not using the model for embedding, 1024 is probably fine. However, embedding models like BERT are non-causal, meaning they need n_batch to be bigger than sequence length.

ngxson avatar May 13 '24 21:05 ngxson

I've got a 7B Q2_K model working! (Total file size: 2.72 GB)

I was able to use a context up to n_ctx: 9 * 1024 using cache_type_k: "q4_0".

The inference speed was around 2 tokens per second when using 6 threads.

Screenshots from console (click to expand) image image image image image

I've uploaded the split-gguf here. To try it, you can use this model URL array:

Array.from({ length: 45 }, (_, i) => `https://huggingface.co/Felladrin/gguf-sharded-smashed-WizardLM-2-7B/resolve/main/WizardLM-2-7B.Q2_K.shard-${(i + 1).toString().padStart(5, "0")}-of-00045.gguf`)

felladrin avatar May 14 '24 22:05 felladrin

Now I've got a 7B Q3_K_M working! (Total file size: 3.52 GB) I think the previous attempt didn't work because I was setting a too-small split size. I've increased to 96 MB per chunk and it's now working.

Array.from({ length: 43 }, (_, i) => `https://huggingface.co/Felladrin/gguf-sharded-Mistral-7B-OpenOrca/resolve/main/Mistral-7B-OpenOrca-Q3_K_M.shard-${(i + 1).toString().padStart(5, "0")}-of-00043.gguf`)

felladrin avatar May 14 '24 23:05 felladrin

*stops watching this space ;-)

flatsiedatsie avatar May 15 '24 20:05 flatsiedatsie

I'm not as lucky it seems. The 7B Q3_K_M with 4K context:

Screenshot 2024-05-15 at 23 17 00

Could it be that Wllama doesn't allow swap to be used?

flatsiedatsie avatar May 15 '24 21:05 flatsiedatsie

@flatsiedatsie, please confirm if you have set cache_type_k: "q4_0" when loading the model. It seems to be failing due to cache_type_k being f16, as per the screenshot.

felladrin avatar May 15 '24 21:05 felladrin

@felladrin You're right! I accidentally had that commented out for some testing.

And.. it's working!!

Thank you both so much! Mistral! On CPU! In the browser! This is a game changer!

flatsiedatsie avatar May 16 '24 07:05 flatsiedatsie

Does n_batch have an effect on memory consumption? Should I set it lower than 1024 for lower contexts? Or is 1024 generally safe?

flatsiedatsie avatar May 16 '24 07:05 flatsiedatsie

I'm happy to see it too!

I usually leave the n_batch unset. By default it will fill it with the same value of n_ctx, and I haven't had problems with memory due to it. But I use a low n_ctx for my case, 2048. I don't know how it affects the memory when the context is larger.

felladrin avatar May 16 '24 07:05 felladrin