gpt4all
gpt4all copied to clipboard
There's Odd RAM Usage While Mixtral 8x7b Loads & It Takes Minutes
Bug Report
Prior to v2.8.0 Mixtral 8x7b Q4_K_M (26.4 GiB) loaded quickly (20 seconds) with my 32 GB RAM PC. But since v2.8.0 (including v3.0.0) it takes about 5x longer to load.
I monitored RAM usage (see images) and instead of simply loading like before, since v2.8.0 it loads, empties, then loads again, then remains flat for minutes as it continues loading (starting when the progress bar is around 25%). Plus after loading it keeps running out of RAM during inference with nothing else loaded. Previously I could have a web browser open and it never ran out of RAM during inference (a 2-3 GB to spare).
Steps to Reproduce
- Load Mixtral 8x7b Q4_K_M (running on CPU)
- Watch RAM usage (maxes out, empties, maxes out again by the time the progress bar is ~25%, then takes minutes to finish loading; see attached image because it's hard to explain).
Expected Behavior
Loads in about 20s (or the speed limit of your SSD), with loading complete once RAM usage reaches its peak.
Your Environment
- GPT4All version: v2.8.0 & v3.0.0
- Operating System: Windows 11 Pro
- Set to run from CPU vs Auto
The first image is the initial load (prior to the progress bar even showing). The second image is part of the second load (continues for another couple minutes like this).
This is likely due to bugs that were introduced into llama.cpp and have since been fixed but we need to update to get those fixes. @cebtenzzre to clarify.
I would try a smaller K quant (K quants are faster than IQ quants on the commit GPT4All is at right now). I believe the amount of RAM needed for a Q4_K_M quant is too close to your maximum RAM limit and since your OS and other apps also need RAM, some of it is offloaded to your swap space, which will reduce speed.
Thanks for the tip. I tried a much smaller version of Mixtral 8x7b (Q3 @ 18.9 GiB), so 32 GB is far more than required (~12 GB to spare).
And as seen in the following image it still maxes out the RAM and takes a long time to load, before dropping back down to the actual RAM required for an 18.9 GiB LLM.
Is it trying to load 2 copies or something? I have an integrated GPU that shares the RAM. Is it also trying to load it in both system and GPU simultaneously despite my selecting CPU only?
I've observed this effect too: I've been using Mixtral 8x7b Q4_K_M on a 32 Gb machine, but loading times and inference were normal before 2.8.0. In 2.8.0 there was a fairly drastic slowdown when loading the model, but once it was fully loaded, the inference worked fine. In my experience, the loading slowdown affects only the MoE models (such as e.g. MixTAO-7Bx2-MoE-v8.1).
@brankoradovanovic-mcom I'm glad it's not just me. Other MOE's seem to load more slowly, but don't appear to do the same pattern of using way more RAM than they need, then coming back down. For example, I tried MixTAO-7Bx2-MOE-v8.1 Q5 from [ZeroWw] and the memory usage during loading was normal.
I measured the loading times of some models on my PC (8 cores, 32 Gb RAM, Windows 10). All quants below are Q5_K_M, except Mixtral (Q4_K_M):
- bagel-34b-v0.2 (dense 34b model) - 5 sec
- MixTAO-7Bx2-MoE-v8.1 - 8 sec
- Beyonder-4x7B-v3 - 14 sec
- Mixtral-8x7B-Instruct-v0.1 - 42 sec (with heavy swapping)
So, while MoE models are somewhat slower to load, ThiloteE's analysis above is quite correct: excessive loading time is due largely to swapping, so the solution is to either get a smaller quant or to get more RAM.
Still, I wonder why this was somehow not a problem at all in earlier versions...
Mixtral-8x7B-Instruct-v0.1 - 42 sec (with heavy swapping)
Mixtral-8x7B-instruct-v0.1 Q4_K_M is loaded in 14 secs on my Ryzen 5 5500u laptop (6 cores, 64GB RAM, Linux, SSD).
I tried a clean boot with and without HDD cache enabled (page file), then loaded either Mixtral 8x7b q3 or Yi-34 q4 (both ~18 GB).
As seen in the following images, the RAM swapping issue appears to be unique to Mixtral and doesn't appear to be caused by the size of the LLM relative to available RAM.
First image is Yi (loads normally to required amount, then stops). Second is Mixtral (maxes out RAM for a couple minutes first).
I think this issue can be closed.
I tested Mixtral with the new Koboldcpp using the latest llama cpp and it had the same issue (slow loading with maxed out RAM before returning to normal RAM usage), apparently ruling out a GPT4ALL bug.
I then read that llama cpp patched Mixtral a while back so I download multiple Mixtral builds, and the newer build (made 26 days ago) loads normally. So the patch appears to have given previous Mixtral GGUFs this issue, but not subsequent builds.
Update: In case anybody's interested I confirmed the following Mixtral's don't have this issue. The first link is Mixtral instruct v0.1, and the second is Nous Hermes 2 Mixtral DPO.
https://huggingface.co/matteocavestri/Mixtral-8x7B-Instruct-v0.1-Q4_K_M-GGUF/tree/main
https://huggingface.co/mradermacher/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF
Mixtral-8x7B-Instruct-v0.1 - 42 sec (with heavy swapping)
Mixtral-8x7B-instruct-v0.1 Q4_K_Mis loaded in 14 secs on my Ryzen 5 5500u laptop (6 cores, 64GB RAM, Linux, SSD).
Finally I got the chance to try the same again on my machine, but this time with 64 Gb RAM. It loads in 13 sec now, with a rather strange memory graph as illustrated below. It ends with 32.5 Gb once the model is loaded, but the peak is far above that value. This indeed seems like an upstream bug of some sort, as described earlier. It seems only the MoE models are affected.
on my Linux it goes strait to 33 GB of ram :
edit : in case the version matters, i think the Mixtral instruct was from here : https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF
edit 2 : GPT4ALL only uses the CPU, (no GPU).
Thanks @brankoradovanovic-mcom and @SuperUserNameMan for taking the time to test with 64 GB of RAM on both Windows and Linux.
Apparently this bug doesn't impact Linux because I tested numerous early M8x7b builds on Win11, including TheBloke's, and they all caused a RAM overshoot during loading.
However, all newer M8x7b builds I tested don't overshoot RAM when loading on Win11, including the two I linked in my previous comment (Instruct & Nous Hermes 2).
Lastly, this is true for all apps, such as Koboldcpp with the latest llama.cpp, so it appears to be a llama.cpp bug. After it was updated at some point past versions of M8x7b that used to load normally started to overshoot RAM on at least Win11, while subsequent builds did not.
The new v3.1.0 builds on an updated version of llama.cpp, can you see how that works for you?
@cosmic-snow I just tested several M8x7b builds with v3.1.0 on W11.
Like with the last version, the newer M8x7b builds load properly (including the 2 I linked above) while the older builds (e.g. TheBloke's) overshoot RAM usage and take much longer to load before coming back down to their actual RAM usage (wasn't previously the case).
So it appears that after a llama.cpp update a bug was introduced in the loading of all previously made versions of M8x7 which does not exist in subsequently made M8x7s builds.
Note: This same issue occurs with other apps (eg Koboldcpp), and according to @SuperUserNameMan it doesn't appear to occur on Linux so it might be specific to Win11.
Alright, thanks for confirming.
I'm closing this issue because the solution is to simply use newer builds of Mixtral 8x7b Instruct or Mixtral fine-tunes such as Nous Hermes 2.