flux
flux copied to clipboard
Cannot start on Ubuntu 24.04: "RuntimeError: unable to mmap ... Cannot allocate memory (12)"
I'm very new to this, and it's possible that I am missing the obvious.
ubuntu 24.04
$ uname -a
Linux benj-pc 6.8.0-40-generic #40-Ubuntu SMP PREEMPT_DYNAMIC Fri Jul 5 10:34:03 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
$ python -V
Python 3.10.14
$ lspci -vnn | grep -A 12 '\''[030[02]\]' | grep -Ei "vga|3d|display|kernel"
0c:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] [1002:73ff] (rev c7) (prog-if 00 [VGA controller])
Kernel driver in use: amdgpu
Kernel modules: amdgpu
$ free -h
total used free shared buff/cache available
Mem: 31Gi 10Gi 1.2Gi 206Mi 19Gi 20Gi
Swap: 8.0Gi 512Ki 8.0Gi
I followed the README instructions, and I get the following:
$ python -m flux --name flux-schnell --loop
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
/home/benj/workspace/flux/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
Traceback (most recent call last):
File "/home/benj/workspace/flux/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 575, in load_state_dict
return torch.load(
File "/home/benj/workspace/flux/.venv/lib/python3.10/site-packages/torch/serialization.py", line 1087, in load
overall_storage = torch.UntypedStorage.from_file(os.fspath(f), shared, size)
RuntimeError: unable to mmap 44541587809 bytes from file </home/benj/.cache/huggingface/hub/models--google--t5-v1_1-xxl/snapshots/3db67ab1af984cf10548a73467f0e5bca2aaaeb2/pytorch_model.bin>: Cannot allocate memory (12)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/benj/workspace/flux/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 584, in load_state_dict
if f.read(7) == "version":
File "/usr/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/benj/workspace/flux/src/flux/__main__.py", line 4, in <module>
app()
File "/home/benj/workspace/flux/src/flux/cli.py", line 250, in app
Fire(main)
File "/home/benj/workspace/flux/.venv/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/benj/workspace/flux/.venv/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/benj/workspace/flux/.venv/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/benj/workspace/flux/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/benj/workspace/flux/src/flux/cli.py", line 158, in main
t5 = load_t5(torch_device, max_length=256 if name == "flux-schnell" else 512)
File "/home/benj/workspace/flux/src/flux/util.py", line 131, in load_t5
return HFEmbedder("google/t5-v1_1-xxl", max_length=max_length, torch_dtype=torch.bfloat16).to(device)
File "/home/benj/workspace/flux/src/flux/modules/conditioner.py", line 18, in __init__
self.hf_module: T5EncoderModel = T5EncoderModel.from_pretrained(version, **hf_kwargs)
File "/home/benj/workspace/flux/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3738, in from_pretrained
state_dict = load_state_dict(resolved_archive_file)
File "/home/benj/workspace/flux/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 596, in load_state_dict
raise OSError(
OSError: Unable to load weights from pytorch checkpoint file for '/home/benj/.cache/huggingface/hub/models--google--t5-v1_1-xxl/snapshots/3db67ab1af984cf10548a73467f0e5bca2aaaeb2/pytorch_model.bin' at '/home/benj/.cache/huggingface/hub/models--google--t5-v1_1-xxl/snapshots/3db67ab1af984cf10548a73467f0e5bca2aaaeb2/pytorch_model.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
RTX 4070 12G - Same problem.
At first, something takes a long time to load, it takes up 9 GB GPU memory, then the message "Loading checkpoint" appears. Then this message:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 72.00 MiB. GPU 0 has a total capacity of 11.73 GiB of which 52.81 MiB is free
You both do not have enough VRAM. This model, even the smaller one, is still very large. As far as I understand, you really do need something like an A100 to run this. I cannot run this on my 4070 Super because UVM doesn't work on nvidia drivers (it should work, but it doesn't), and like you, I run out of VRAM.
If you do not have enough VRAM, or working "Shared Memory" (nvidia doesn't work: https://github.com/NVIDIA/open-gpu-kernel-modules/issues/663)
I can't tell you for AMD, how to get shared memory to work. I have never had an AMD card. Good luck.
If you want to try it without the GPU, python -m flux --name flux-schnell --loop --device cpu will allow you to run this. However, as you can imagine, very, very slow. Integrated GPUs also can be used, if you have one. Usually these have good shared memory support.
Unfortunately, on Windows, shared memory usually works fine. You may have better luck with that+ WSL2.
The real solution is for the graphics card manufacturers to fix their shared memory implementation in their linux kernel drivers. You can send them a strongly worded email :smiley:
I have a 3060 12GB, driver version 555.58.02, and I am running Pop OS, using the provided scripts gave me the same out of memory error. However, I was able to run it with the following script in the same venv.
import torch
from diffusers import FluxPipeline
torch.cuda.empty_cache()
pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
prompt = "Your Prompt Here"
out = pipe(
prompt=prompt,
guidance_scale=1.5,
height=768,
width=1360,
num_inference_steps=7,
).images[0]
out.save("image.png")
Hope this helps.
Thanks man. It sucks the startup examples given in the readme don't take this into account - however - an effective gatekeeping method (It certainly gatekept me!)