Qwen3ForCausalLM leaks VRAM if used in multiple dataloader threads
System Info
torch 2.8.0 transformers==4.56.2 or ransformers==4.57.3, both tested
Who can help?
@ArthurZucker @Cyrilvallez
Information
- [ ] The official example scripts
- [x] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [x] My own task or dataset (give details below)
Reproduction
Please see repro code below. In this test case, it runs out of VRAM within only a few steps. In the real use case it takes about 50 iterations. I hope it's still the same root cause, but I'm not entirely sure
import threading
from transformers import Qwen3ForCausalLM
from diffusers import ZImagePipeline
pipe = ZImagePipeline.from_pretrained(
"Tongyi-MAI/Z-Image-Turbo",
torch_dtype=torch.bfloat16,
)
text_encoder = pipe.text_encoder
text_encoder.to('cuda')
def run():
tokens = torch.zeros((1, 512), device='cuda', dtype=torch.int64)
tokens_attention_mask = torch.ones((1, 512), device='cuda')
tokens_attention_mask[:, 200:] = 0.0
i = 0
while True:
i += 1
print(i)
text_encoder_output = text_encoder(
tokens,
attention_mask=tokens_attention_mask,
output_hidden_states=True,
return_dict=True,
use_cache=False,
)
thread1 = threading.Thread(target=run)
thread2 = threading.Thread(target=run) # <--- comment this to see it working with no issues
thread1.start()
thread2.start() # <--- comment this to see it working with no issues
Expected behavior
see above
Hmmn, is this a memory leak? It feels more like running two instances simultaneously will eventually cause both of them to hit peak memory usage simultaneously and go OOM. Can you try just running a single instance over and over and checking if max VRAM utilization in torch is increasing?
in my real use case it was definitely a leak. vram increased slowly and by iteration ~ 50 I ran out of vram. Note that I was not keeping the output, but saving it to disk and discarding it. when I tried to make a small test case (see above) it OOMed much faster, by step 2. Maybe my reproduction is mistaken, but the issue I wanted to report is a leak.
Can you try just running a single instance over and over and checking if max VRAM utilization in torch is increasing?
That works. In the repro code above, and in the real use case. The failure is with running it in multiple threads.
Got it - it does seem like a memory leak, in that case, but it seems unlikely that there's anything we can do in Transformers to fix that; we're unlikely to be leaking memory in the modeling code if the single-threaded version works fine. If anyone wants to do a deep dive to investigate if this is Qwen-specific, or if there are certain conditions that trigger it, feel free.
However, if you just want this to work, I'd suggest refactoring so you don't have multiple threads sharing the same GPU like that. Instead, if you want to boost speed you should just use a single-threaded dataloader that loads multiple samples at the same time and then runs them through the network together, then optionally splits them afterwards and yields the outputs one-by-one. Models are generally well-optimized for big single batch tensors, and you won't get the same speedup with two separate models in memory that are running single samples out of sync with each other.
It probably comes from output_hidden_states=True, I believe they don't get cleaned up correctly or something cause some other threads keep a ref probably.
Could you try without it? Also, would be nice to know exactly what model you're using from Transformers, as your example is using an encoder from a Diffusers model
what model you're using from Transformers, as your example is using an encoder from a Diffusers model
Qwen3ForCausalLM
It probably comes from
output_hidden_states=True
I think you're right. I don't think my test case above is an adequate reproduction of the issue, but in the real use case I could
- reproduce the leak again
- additionally sometimes,
text_encoder_output.hidden_stateswasNonefor no reason whenoutput_hidden_states=True - no leak when
output_hidden_states=False - when
output_hidden_states=Trueit leaks even if I never accesstext_encoder_output.hidden_states, to be 100% sure that I'm not accidentally saving it and thereby causing the leak
I'd suggest refactoring so you don't have multiple threads sharing the same GPU
I agree. Calling the text encoder from multiple threads is more a side effect of running a dataloading pipeline that loads images, does transformations, VAE encoding and text encoding in multiple threads. it isn't the intention to accelerate just the text encoding by using multiple threads. batching would be the right approach to that.