Option for disabling mmap for safetensors loading for network storage users
Hi - few weeks ago I opened an issue on CPU bottleneck, finally found out the root cause. It wasn't the CPU bottleneck really - it was the CPU managing frantically the mmap over network volume bottleneck.
For network storage, the code in comfy/utils.py line 13
sd = safetensors.torch.load_file(ckpt, device=device.type)
uses mmap and on network volumes this is hugely inefficient - it's about a 30-50x slowdown. A single SDXL safetensors takes 1-2 seconds with the following over network volume, but 40-50s in the vanilla way above.
I hacked together this:
try:
sd = safetensors.torch.load(open(ckpt, 'rb').read())
except:
sd = safetensors.torch.load_file(ckpt, device=device.type)
so it worked on my SDXL safetensors, while also falling back to the normal for certain controlnet checkpoints.
This issue has been referenced already in https://github.com/comfyanonymous/ComfyUI/issues/1992#issuecomment-1817797912
I think a way to disable mmap in the first way is necessary otherwise models are extremely inefficient to load on any cloud provider platform that runs on K8s with network PVCs.
Maybe this is only part of it. Still seeing super long loading times (perhaps for the except branch?) It's been really hard to work with with some requests taking 50+ seconds just for loading and graph building. I'm doing as much as I can to reuse, like setting lora weights to 0.001 to try not to offload it, but somehow it still just keeps loading the model again.
Edit: it looks like the issues I'm running into - hanging on graph building/loading, 100% CPU usage, is related to: https://github.com/comfyanonymous/ComfyUI/pull/1503 for complex workflows. Experimenting soon on this.
Tried the fix, it seems like it's reusing a bit less but still long graph building times. Definitely think room for improvement there and will try to simplify workflows so it has less to do.
Hi - few weeks ago I opened an issue on CPU bottleneck, finally found out the root cause. It wasn't the CPU bottleneck really - it was the CPU managing frantically the mmap over network volume bottleneck.
I'd never really read the docs for safetensors until the other day when I was trying to convert a msgpack controlnet to anything compatible with a normal GPU without writing anything, and it looks like the reason for that mess is that they designed .safetensors around lazy loading the model. That's probably better for gigantic models that don't need all of their parts to function. I don't think they ever considered situations involving network connections that weren't fully offloaded to something that was doing RDMA, which eliminates the issue of mmaping over a network pretty much entirely. Most of the ML modules for python were only tested on server grade compute cards, 2s Epyc systems, and Infiniband interconnects where none of these issues show up AFAICT.
Also, 1-2s for SDXL unets from a cloud provider? Is that on some compute instance or do you have a 40 / 56 / 100Gb connection to the internet? :-) I mean more likely it's caching in memory when mmap isn't but if that's a clean load I want that connection. :-)
On your system is A1111's speed any better in this regard?
Also, 1-2s for SDXL unets from a cloud provider? Is that on some compute instance or do you have a 40 / 56 / 100Gb connection to the internet? :-) I mean more likely it's caching in memory when mmap isn't but if that's a clean load I want that connection. :-)
From my experience, it should actually take 1-2 minutes for SDXL, as I've encountered the same issue.
nodes will also load the model, so do I need to modify it as well?