transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Qwen3ForCausalLM leaks VRAM if used in multiple dataloader threads

Open dxqb opened this issue 2 weeks ago • 3 comments

System Info

torch 2.8.0 transformers==4.56.2 or ransformers==4.57.3, both tested

Who can help?

@ArthurZucker @Cyrilvallez

Information

  • [ ] The official example scripts
  • [x] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [x] My own task or dataset (give details below)

Reproduction

Please see repro code below. In this test case, it runs out of VRAM within only a few steps. In the real use case it takes about 50 iterations. I hope it's still the same root cause, but I'm not entirely sure

import threading

from transformers import Qwen3ForCausalLM


from diffusers import ZImagePipeline

pipe = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo",
    torch_dtype=torch.bfloat16,
)

text_encoder = pipe.text_encoder
text_encoder.to('cuda')

def run():
    tokens = torch.zeros((1, 512), device='cuda', dtype=torch.int64)
    tokens_attention_mask = torch.ones((1, 512), device='cuda')
    tokens_attention_mask[:, 200:] = 0.0

    i = 0
    while True:
        i += 1
        print(i)
        text_encoder_output = text_encoder(
            tokens,
            attention_mask=tokens_attention_mask,
            output_hidden_states=True,
            return_dict=True,
            use_cache=False,
        )


thread1 = threading.Thread(target=run)
thread2 = threading.Thread(target=run) # <--- comment this to see it working with no issues

thread1.start()
thread2.start() # <--- comment this to see it working with no issues

Expected behavior

see above

dxqb avatar Dec 06 '25 09:12 dxqb

Hmmn, is this a memory leak? It feels more like running two instances simultaneously will eventually cause both of them to hit peak memory usage simultaneously and go OOM. Can you try just running a single instance over and over and checking if max VRAM utilization in torch is increasing?

Rocketknight1 avatar Dec 08 '25 14:12 Rocketknight1

in my real use case it was definitely a leak. vram increased slowly and by iteration ~ 50 I ran out of vram. Note that I was not keeping the output, but saving it to disk and discarding it. when I tried to make a small test case (see above) it OOMed much faster, by step 2. Maybe my reproduction is mistaken, but the issue I wanted to report is a leak.

Can you try just running a single instance over and over and checking if max VRAM utilization in torch is increasing?

That works. In the repro code above, and in the real use case. The failure is with running it in multiple threads.

dxqb avatar Dec 08 '25 15:12 dxqb

Got it - it does seem like a memory leak, in that case, but it seems unlikely that there's anything we can do in Transformers to fix that; we're unlikely to be leaking memory in the modeling code if the single-threaded version works fine. If anyone wants to do a deep dive to investigate if this is Qwen-specific, or if there are certain conditions that trigger it, feel free.

However, if you just want this to work, I'd suggest refactoring so you don't have multiple threads sharing the same GPU like that. Instead, if you want to boost speed you should just use a single-threaded dataloader that loads multiple samples at the same time and then runs them through the network together, then optionally splits them afterwards and yields the outputs one-by-one. Models are generally well-optimized for big single batch tensors, and you won't get the same speedup with two separate models in memory that are running single samples out of sync with each other.

Rocketknight1 avatar Dec 08 '25 16:12 Rocketknight1

It probably comes from output_hidden_states=True, I believe they don't get cleaned up correctly or something cause some other threads keep a ref probably. Could you try without it? Also, would be nice to know exactly what model you're using from Transformers, as your example is using an encoder from a Diffusers model

Cyrilvallez avatar Dec 12 '25 15:12 Cyrilvallez

what model you're using from Transformers, as your example is using an encoder from a Diffusers model

Qwen3ForCausalLM

It probably comes from output_hidden_states=True

I think you're right. I don't think my test case above is an adequate reproduction of the issue, but in the real use case I could

  • reproduce the leak again
  • additionally sometimes, text_encoder_output.hidden_states was None for no reason when output_hidden_states=True
  • no leak when output_hidden_states=False
  • when output_hidden_states=True it leaks even if I never access text_encoder_output.hidden_states, to be 100% sure that I'm not accidentally saving it and thereby causing the leak

I'd suggest refactoring so you don't have multiple threads sharing the same GPU

I agree. Calling the text encoder from multiple threads is more a side effect of running a dataloading pipeline that loads images, does transformations, VAE encoding and text encoding in multiple threads. it isn't the intention to accelerate just the text encoding by using multiple threads. batching would be the right approach to that.

dxqb avatar Dec 15 '25 19:12 dxqb