stable-diffusion-webui [Bug]: slow loading .safetensors when switching to a new model

trafficstars

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What happened?

Is this still a known issue? .safetensors are loading very slow (up to 300secs) if you switch the model in the model dropdown. Models don't load slow if you restart the whole webui.

Steps to reproduce the problem

Switch to another .safetensors model in model dropdown

What should have happened?

Tested on 1.3.2 and also on 1.4.0-RC (Windows WSL2)

Commit where the problem happens

v1.3.2 and above

What Python version are you running on ?

Python 3.10.x

What platforms do you use to access the UI ?

Windows

What device are you running WebUI on?

Nvidia GPUs (RTX 20 above)

What browsers do you use to access the UI ?

Google Chrome

Command Line Arguments

--listen --xformers --autolaunch --enable-insecure-extension-access --api --deepdanbooru --opt-split-attention --opt-channelslast

List of extensions

No

Console logs

Using TCMalloc: libtcmalloc.so.4
Python 3.10.12 (main, Jun  7 2023, 12:45:35) [GCC 9.4.0]
Version: v1.3.2
Commit hash: baf6946e06249c5af9851c60171692c44ef633e0
Installing requirements
Launching Web UI with arguments: --listen --xformers --autolaunch --enable-insecure-extension-access --api --deepdanbooru --opt-split-attention --opt-channelslast
Loading weights [49ef66fc4c] from /home/dragon/stable-diffusion-webui/models/Stable-diffusion/kotosmix_v10.safetensors
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 12.7s (import torch: 0.8s, import gradio: 0.7s, import ldm: 0.8s, other imports: 0.8s, opts onchange: 5.7s, list SD models: 0.6s, scripts list_optimizers: 0.2s, create ui: 0.3s, gradio launch: 2.6s, scripts app_started_callback: 0.1s).
Creating model from config: /home/dragon/stable-diffusion-webui/configs/v1-inference.yaml
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Loading VAE weights specified in settings: /home/dragon/stable-diffusion-webui/models/VAE/vae-ft-mse-840000-ema-pruned.ckpt
Applying optimization: xformers... done.
Model loaded in 21.1s (load weights from disk: 3.1s, create model: 0.6s, apply weights to model: 15.7s, apply channels_last: 0.3s, apply half(): 0.2s, load VAE: 0.3s, move model to device: 0.7s, load textual inversion embeddings: 0.2s).
Loading weights [af220b387c] from /home/dragon/stable-diffusion-webui/models/Stable-diffusion/epicrealism_newCentury.safetensors
Loading VAE weights specified in settings: cached vae-ft-mse-840000-ema-pruned.ckpt
Applying optimization: xformers... done.
Weights loaded in 193.7s (load weights from disk: 3.9s, apply weights to model: 189.2s, move model to device: 0.6s).

Additional information

No response

Jun 14 '23 10:06 dhwz

This is normal speed if you read the model on HDD.

Jun 14 '23 13:06 Sakura-Luna

It's on a SSD and .ckpt load fast, only .safetensors load extremely slow. The first on startup loaded in 21sec and the other one in 193secs?

Jun 14 '23 13:06 dhwz

ckpt is similar to safetensors in speed, only related to storage location.

Jun 14 '23 13:06 Sakura-Luna

You should test the read / write speed of the storage device.

Jun 14 '23 13:06 Sakura-Luna

Did even more tests, issue only happens to .safetensors not to .ckpt, not matter if I use a HDD or a SSD (of course SSD is a bit faster) but even on HDD .ckpt loads in just 26secs not 300secs. And if I check the log output the most time is gone to "apply weights to model" thats not loading the model from the drive? And if it's the drive then why is it fast on starting the webui?

@Narsil Maybe you have an idea whats going on? Can this be related to some NVIDIA driver issue?

Jun 14 '23 16:06 dhwz

I don't know what it could be.

The first load is fast, then subsequent loads are slow.

This is odd indeed, since normally it should be the other way around (first time, could actually load from disk, while subsequent calls should load from RAM if the model still fits in cache).

Scenarios I can imagine:

On first load, the model you are loading was already loaded previously, therefore fast, while during regular usage for webui, you swap between multiple models evicting the safetensors one, leading to slower loads (300s seems insanely slow regardless)
Bug in Windows memory-mapping (we leverage torch.Storage, we do not all memmap directly in safetensors, unless you have a very old torch version which doesn't have torch.Storage). That one can be checked if you replace safetensors.torch.load_file(filename) with safetensors.torch.load(open(filename, 'rb').read())) you would be avoiding memmap entirely
Bad disk sector ? Not sure how this applies at all anymore, but I remember in the old days you could have bad sectors on your disk where Windows would constantly have issues with. (It's very much not my area, but in Linux it happens sometimes that some sectors get banned entirely). I have no idea how we could confirm/infirm that hypothesis
F16/f32 conversion. Another option could be that the weights are stored in f32, load super fast, but take super long because they are applied to a f16 model (or vice versa). Either torch tries to convert one into the other which is slow, or it has to reallocate a bunch of things to make it fit.
CPU-> CUDA order, nvidia thing. Another thing I would try out is loading all weights on CPU then moving everything to CUDA: weights = safetensors.torch.load_file(filename, device="cpu"); weights = {k: v.to("cuda:0") for k, v in weights.items() VS loading directly on cuda weights = safetensors.torch.load_file(filename, device="cuda:0"). I've seen this impact performance, never quite by a 10x factor, but if it's linked to some bad nvidia driver thing, it might be that.

If you can try a few things I mention here, I could try looking into it, but my windows knowledge is rather limited, and I never reproduced anything of the sort.

Another note memmapping is super efficient for local disks, if you're actually running on a mounted network partition, the the memap variant might trigger a lot more READS which are individually slower. (Scenario 2 will help solve it)

Jun 15 '23 16:06 Narsil

@Narsil replacing pl_sd = safetensors.torch.load_file(checkpoint_file, device=device) by pl_sd = safetensors.torch.load(open(checkpoint_file, 'rb').read()) fixes the issue, seems you were right regarding that mem-mapping thing.

FYI setting export NO_TCMALLOC="True" in webui-user.sh doesn't change anything (was just wondering because of these tcmalloc messages after the change).

Startup time: 7.8s (import torch: 0.7s, import gradio: 0.7s, import ldm: 0.7s, other imports: 0.7s, opts onchange: 0.8s, list SD models: 0.5s, load scripts: 0.9s, create ui: 0.6s, gradio launch: 1.4s, scripts app_started_callback: 0.6s).
Creating model from config: /home/dragon/stable-diffusion-webui/configs/v1-inference.yaml
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Loading VAE weights specified in settings: /home/dragon/stable-diffusion-webui/models/VAE/vae-ft-mse-840000-ema-pruned.ckpt
Applying optimization: xformers... done.
Textual inversion embeddings loaded(16): 3nid_14, 3nid_14-light, 3nid_15, 3nid_15-light, AriGra-15000, ChlGraMtz-30000, HitomiTanaka-12500, ImaVel-11000, KirShip-14200, kodakvision, KriSte-30000, MilBobBrn-14500, SarHyl-10000, SydSwe, TayTay-15000, Zendy-28000
Model loaded in 6.1s (load weights from disk: 2.7s, create model: 0.7s, apply weights to model: 0.9s, apply channels_last: 0.3s, apply half(): 0.3s, load VAE: 0.4s, move model to device: 0.6s, load textual inversion embeddings: 0.2s).
Loading weights [f44ba7cd90] from /home/dragon/stable-diffusion-webui/models/Stable-diffusion/hassanblend1512AndPrevious_hassanblend15.safetensors
tcmalloc: large alloc 8195555328 bytes == 0x7f4b9068e000 @  0x7f4eeac37680 0x7f4eeac58824 0x62e02b 0x67f131 0x53a9bc 0x643f75 0x686145 0x53fa96 0x5a8d85 0x628b20 0x5a897b 0x628b20 0x5a897b 0x628b20 0x5ad363 0x628b20 0x6287fa 0x5ab27b 0x628b20 0x5ad363 0x5484aa 0x5ad363 0x628b20 0x5a9a35 0x628b20 0x643f75 0x43d673 0x53a554 0x6287fa 0x5ae7ee 0x628b20
Loading VAE weights specified in settings: cached vae-ft-mse-840000-ema-pruned.ckpt
Applying optimization: xformers... done.
Weights loaded in 10.2s (load weights from disk: 9.1s, apply weights to model: 0.3s, move model to device: 0.6s).
Loading weights [f94d96ebdc] from /home/dragon/stable-diffusion-webui/models/Stable-diffusion/hassanblend1512AndPrevious_hassanblend1512.safetensors
tcmalloc: large alloc 4097941504 bytes == 0x7f4b8b3be000 @  0x7f4eeac37680 0x7f4eeac58824 0x62e02b 0x67f131 0x53a9bc 0x643f75 0x686145 0x53fa96 0x5a8d85 0x628b20 0x5a897b 0x628b20 0x5a897b 0x628b20 0x5ad363 0x628b20 0x6287fa 0x5ab27b 0x628b20 0x5ad363 0x5484aa 0x5ad363 0x628b20 0x5a9a35 0x628b20 0x643f75 0x43d673 0x53a554 0x6287fa 0x5ae7ee 0x628b20
Loading VAE weights specified in settings: cached vae-ft-mse-840000-ema-pruned.ckpt
Applying optimization: xformers... done.
Weights loaded in 17.2s (load weights from disk: 16.3s, apply weights to model: 0.2s, move model to device: 0.6s).
Loading weights [2bd89c4fad] from /home/dragon/stable-diffusion-webui/models/Stable-diffusion/lazymixRealAmateur_v20.safetensors
tcmalloc: large alloc 2132680704 bytes == 0xeeba000 @  0x7f4eeac37680 0x7f4eeac58824 0x62e02b 0x67f131 0x53a9bc 0x643f75 0x686145 0x53fa96 0x5a8d85 0x628b20 0x5a897b 0x628b20 0x5a897b 0x628b20 0x5ad363 0x628b20 0x6287fa 0x5ab27b 0x628b20 0x5ad363 0x5484aa 0x5ad363 0x628b20 0x5a9a35 0x628b20 0x643f75 0x43d673 0x53a554 0x6287fa 0x5ae7ee 0x628b20
Loading VAE weights specified in settings: cached vae-ft-mse-840000-ema-pruned.ckpt
Applying optimization: xformers... done.
Weights loaded in 9.5s (load weights from disk: 8.7s, apply weights to model: 0.2s, move model to device: 0.6s).
Loading weights [33c9f6dfcb] from /home/dragon/stable-diffusion-webui/models/Stable-diffusion/majicmixRealistic_v5.safetensors
tcmalloc: large alloc 2400043008 bytes == 0x7f4b8b3be000 @  0x7f4eeac37680 0x7f4eeac58824 0x62e02b 0x67f131 0x53a9bc 0x643f75 0x686145 0x53fa96 0x5a8d85 0x628b20 0x5a897b 0x628b20 0x5a897b 0x628b20 0x5ad363 0x628b20 0x6287fa 0x5ab27b 0x628b20 0x5ad363 0x5484aa 0x5ad363 0x628b20 0x5a9a35 0x628b20 0x643f75 0x43d673 0x53a554 0x6287fa 0x5ae7ee 0x628b20
Loading VAE weights specified in settings: cached vae-ft-mse-840000-ema-pruned.ckpt
Applying optimization: xformers... done.
Weights loaded in 2.5s (load weights from disk: 1.8s, apply weights to model: 0.2s, move model to device: 0.4s).
Loading weights [17d48dc743] from /home/dragon/stable-diffusion-webui/models/Stable-diffusion/pornmasterPro_fp32V2.safetensors
tcmalloc: large alloc 4265148416 bytes == 0x7f4b8b3be000 @  0x7f4eeac37680 0x7f4eeac58824 0x62e02b 0x67f131 0x53a9bc 0x643f75 0x686145 0x53fa96 0x5a8d85 0x628b20 0x5a897b 0x628b20 0x5a897b 0x628b20 0x5ad363 0x628b20 0x6287fa 0x5ab27b 0x628b20 0x5ad363 0x5484aa 0x5ad363 0x628b20 0x5a9a35 0x628b20 0x643f75 0x43d673 0x53a554 0x6287fa 0x5ae7ee 0x628b20
Loading VAE weights specified in settings: cached vae-ft-mse-840000-ema-pruned.ckpt
Applying optimization: xformers... done.
Weights loaded in 17.8s (load weights from disk: 16.9s, apply weights to model: 0.3s, move model to device: 0.5s).
Loading weights [99fd5c4b6f] from /home/dragon/stable-diffusion-webui/models/Stable-diffusion/seekArtMEGA_mega20.safetensors
Loading VAE weights specified in settings: cached vae-ft-mse-840000-ema-pruned.ckpt
Applying optimization: xformers... done.
Weights loaded in 9.1s (load weights from disk: 8.4s, apply weights to model: 0.2s, move model to device: 0.4s).
Loading weights [24a393500f] from /home/dragon/stable-diffusion-webui/models/Stable-diffusion/perfectWorld_v4Baked.safetensors
tcmalloc: large alloc 4265099264 bytes == 0x7f4b8b3be000 @  0x7f4eeac37680 0x7f4eeac58824 0x62e02b 0x67f131 0x53a9bc 0x643f75 0x686145 0x53fa96 0x5a8d85 0x628b20 0x5a897b 0x628b20 0x5a897b 0x628b20 0x5ad363 0x628b20 0x6287fa 0x5ab27b 0x628b20 0x5ad363 0x5484aa 0x5ad363 0x628b20 0x5a9a35 0x628b20 0x643f75 0x43d673 0x53a554 0x6287fa 0x5ae7ee 0x628b20
Loading VAE weights specified in settings: cached vae-ft-mse-840000-ema-pruned.ckpt
Applying optimization: xformers... done.
Weights loaded in 18.0s (load weights from disk: 17.0s, apply weights to model: 0.3s, move model to device: 0.5s).

So is that something you can fix in safetensors or do we need some option in webui to allow alternative loading method?

FYI my webui is running on Ubuntu 20.04 inside WSL2 on Windows 11 (so it's not directly Windows, but accessing Windows drivers).

Jun 15 '23 17:06 dhwz

So is that something you can fix in safetensors or do we need some option in webui to allow alternative loading method?

Unfortunately, this might be a WSL/Windows things, not really a safetensors (or webui) thing. It would be the same if you had a network mounted device.

MMAP (the first load method) can be pretty much instant (like 100us) to load on CPU, while the second load(open().read()) is going to be pretty much slow (the least work, is the OS allocating and feeding your the file into the process memory).

So on devices that support mmap, mmap is just always better (since it can skip giving memory to user space, and directly map kernel pages for instance). That being said, mmap can potentially issue a lot more reads if this property is not upholded (which definitely can be the case in WSL, since linux/Windows most likely work differently). There is no way to know about that within safetensors nor webui (afaik).

reading the whole file is better in your case, because you're essentially issuing a single read which plays better when the overhead is bad.

I think having a flag within webui to switch loading methods could be a things (so users than benefit the mmap speed can keep having it, and you can opt-out to not suffer on your platform). If there's a way to detect WSL and WSL is the issue, maybe they can set a better default. In general in safetensors this is not possible, because mounted network disks also exists and for that I'm sure there's no way to detect (since well the goal of mounteded network disks is to pretend to be regular disks) and for that you just need to use the non mmap version.

Jun 16 '23 13:06 Narsil

ok thx, no problem I've created a PR hope it gets accepted

@Sakura-Luna so can we please agree on this that it is a bug and I've added a PR for it https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/11260

Jun 16 '23 16:06 dhwz

Adding -lowram fixed slow loading for me a while ago, removed it recently to test and they loaded slower, this is without WSL too. I have 64GB of RAM and a 3080 12GB, maybe it's some weird Windows memory mapping issue and having more RAM above a certain amount makes them load slower, for whatever reason.

Jun 16 '23 21:06 freecoderwaifu

Interesting I've also 64GB The proposed PR fixes it without the need to set --lowram

Jun 16 '23 21:06 dhwz

so can we please agree on this that it is a bug and I've added a PR for it #11260

This should be regarded as a defect introduced upstream.

There was a PR about this before, but it was closed.

Jun 17 '23 06:06 Sakura-Luna

@freecoderwaifu I've tried the --lowram option but doesn't fix my issue it's still slow after enabling it.

Jun 18 '23 09:06 dhwz

These are my changes in the webui-user.bat set COMMANDLINE_ARGS= --xformers --no-hashing --opt-channelslast --no-download-sd-model --lowram set SAFETENSORS_FAST_GPU=1

set safetensors is a big speedup too.

Average load speed for me: Weights loaded in 11.4s (load weights from disk: 10.1s, apply weights to model: 0.5s, load VAE: 0.4s, move model to device: 0.4s).

Without both --lowram and set SAFETENSORS_FAST_GPU=1 it goes above 1 minute, sometimes close to 2.

It's still good you're addressing it with your PR, since I'm not sure what else --lowram affects aside from forcing models to load to VRAM as per the description in the wiki, haven't had any other issues since using it so I don't really know.

Jun 18 '23 20:06 freecoderwaifu

set SAFETENSORS_FAST_GPU=1

This one shouldn't have any effect for version > 0.3.0 anymore... odd.

Jun 19 '23 08:06 Narsil

Fixed in dev after merge of the PR (14196548c55dfe4775c96bdb939ce1a150933393)

Jun 27 '23 06:06 dhwz

Still not fixed for me in v1.5.1 I have to use --lowram despite having 64 Gb of DDR5 memory, otherwise switching between .safetensor checkpoints takes ~70s, 60s+ of this time applying weights to model. With --lowram switching takes ~1.5s applying weights to model in 0.3s. I'm using 13900k and RTX 4090 if this matters.

Jul 27 '23 22:07 63OR63

The issue seems to stem from WSL and memory mapping not playing along very well:

https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/11216

Can you confirm ?

Jul 28 '23 07:07 Narsil

I initially thought it's a WSL issue but it's a general Windows issue as it's also happening without WSL.

Jul 28 '23 07:07 dhwz

Does item 2 from here https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/11216#issuecomment-1593378136 help ?

If so it's definitely a memory map issue, but what's really odd is that I'm never able to reproduce it (I'm using Windows in the cloud because I don't own any such machine anymore :( )

Jul 28 '23 08:07 Narsil

Does item 2 from here #11216 (comment) help ?

That's exactly what the added option does, for me it completely fixed the issue. Did you try the "Disable memmapping for loading .safetensors files" option in settings?

Jul 28 '23 08:07 dhwz

I was having the same issue, but it would take in the 500s of seconds to load a .safetensor. Changing the option in the WebUI Settings>System>Disable memmapping for loading ..... brought the time down to the 60-80s range for loading them. My files are on a standard HDD though and I will likely move them to a SSD soon.

Aug 04 '23 00:08 haymoreb

Same issue here, load a model from a network device. The speed is unstable between 20-100MB/s without --lowram option. After add --lowram option, the speed is about 150-200MB/s, i think this is the normal speed it should be.

I use a 2.5G Ethernet Adapter. i using iperf test the etheadsfrnet speed is ok. And i test direct copy that model to my local disk, the speed is about 150-200MB/s, just same as webui with lowram option.

Maby issue still exists.

This is loading time log without lowram option.

This is loading time log with lowram option.

This is consistent with the network response.

Sep 22 '23 11:09 freeAhao

AFAICT (it's been a while since I looked at the python source code because a different UI was being slow), python's mmap is just a blind implementation of POSIX mmap without things like "advice", etc but calling into the Windows API MapViewOfFile stuff. Maybe more importantly is that Windows memory mapped files support the same complicated set of ACLs that regular files do which I could see causing problems. Windows would be able to do full optimizations on a file mapped for writing or only reading (not copy on write) and with security set to the current user, but python opens them without an ACL (I think; again, it's been a minute) and in "copy on write" mode so Windows has to assume any process can write into the memory and it has to transparently make a copy of the whole mess to keep the originally loaded file intact. If something starts writing into the mapped file to patch it or whatever prior to it fully loading... who knows. If someone or the OEM they bought the machine from turned on memory de-duplication it'd really start killing performance around then, but that should be off by default even on Server. Compression shouldn't kick in that fast.

Usually Windows software is either written to do all of this correctly or at least handle it a bit better.

Add the fact that "huge pages / large pages" aren't enabled by default on Windows (you'll need to google that, Microsoft explains how to enable it since it's a user permission) and memory pages end up defaulting to a very small size so there's a lot of activity going on creating page tables for the whole mess.

Nov 19 '23 21:11 NeedsMoar

Hi guys, what's the settings of disable memmapping, and how to pass to the launch.py?I did not find any docs.

Jan 21 '24 11:01 hitsub2

It does what is discussed above, improving the loading speed of .safetensors It's not a launch parameter, it has to be enabled in settings.

Jan 21 '24 11:01 dhwz

It does what is discussed above, improving the loading speed of .safetensor It's not a launch parameter, it has to be enabled in settings.

thanks for your quick reply.I am new to sd-webui, is there any examples of settings(what it is and how to pass)?

Jan 21 '24 12:01 hitsub2

It does what is discussed above, improving the loading speed of .safetensor It's not a launch parameter, it has to be enabled in settings.

thanks for your quick reply.I am new to sd-webui, is there any examples of settings(what it is and how to pass)?

for example:

Jan 23 '24 03:01 ZhiYing-Yang

@yang-zhiying Hi Zhiying, sorry for the wrong description of my questions. I want to know how to do this when launching the api.

Jan 23 '24 03:01 hitsub2

@yang-zhiying Hi Zhiying, sorry for the wrong description of my questions. I want to know how to do this when launching the api.

sorry for my wrong anwser. you can edit the config.json, Change disable_mmap_load_safetensors configuration from false to true

Jan 23 '24 03:01 ZhiYing-Yang

stable-diffusion-webui stable-diffusion-webui copied to clipboard

[Bug]: slow loading .safetensors when switching to a new model

Is there an existing issue for this?

What happened?

Steps to reproduce the problem

What should have happened?

Commit where the problem happens

What Python version are you running on ?

What platforms do you use to access the UI ?

What device are you running WebUI on?

What browsers do you use to access the UI ?

Command Line Arguments

List of extensions

Console logs

Additional information

stable-diffusion-webui
stable-diffusion-webui copied to clipboard