transformers
transformers copied to clipboard
Lazy loading models on systems with more VRAM than RAM
Feature request
I would like the ability to lazy load models to the GPU using AutoModelForCausalLM.from_pretrained
.
At the moment, it is possible to reduce the RAM usage using the low_cpu_mem_usage=True
option, but on systems with more VRAM than RAM (like Google Colab with 12GB RAM and 16GB VRAM), it is not possible to load certain models due to a RAM bottleneck.
Motivation
See above
Your contribution
--
Could you please share a snippet of code that fails on such an env with device_map="auto"
sent to from_pretrained
? This loads the model directly on the GPU (as long as there is enough space) so this should work for your use case.
Surely, here is a snippet that causes an out of memory error on Google Colab (the free instance with 12.7GB RAM and 15GB VRAM):
!pip install -U accelerate transformers
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("PygmalionAI/pygmalion-6b", device_map='auto')
I have tried every possible combination of .cuda()
and low_cpu_mem_usage=True
:
model = AutoModelForCausalLM.from_pretrained("PygmalionAI/pygmalion-6b", device_map='auto')
model = AutoModelForCausalLM.from_pretrained("PygmalionAI/pygmalion-6b", device_map='auto').cuda()
model = AutoModelForCausalLM.from_pretrained("PygmalionAI/pygmalion-6b", low_cpu_mem_usage=True, device_map='auto')
model = AutoModelForCausalLM.from_pretrained("PygmalionAI/pygmalion-6b", low_cpu_mem_usage=True, device_map='auto').cuda()
In all cases, the RAM usage steadily increases until it passes the 12GB mark and the Colab session crashes. On my machine, this model uses 11653.7 GiB VRAM and 2605.79 GiB RAM once fully loaded to the GPU, so in principle it should be possible to load it on Colab.
I think you are missing a torch_dtype=torch.float16
or torch_dtype=torch.bfloat16
to get to 12GB of use. Otherwise the model will need 24GB of memory if it has 6b parameters (the default torch dtype in PyTorch being float32).
You are correct, both of these allow me to load the model successfully:
model = AutoModelForCausalLM.from_pretrained("PygmalionAI/pygmalion-6b", low_cpu_mem_usage=True, device_map='auto', torch_dtype=torch.float16)
model = AutoModelForCausalLM.from_pretrained("PygmalionAI/pygmalion-6b", device_map='auto', torch_dtype=torch.float16)
But with these, the RAM usage after the model is loaded is very high: 12.2GB out of a total of 12.7GB. This makes the session very unstable and prone to crashing if other libraries are imported.
Is this high RAM usage normal? Can it be avoided?
Can you try to see if adding a layer of garbage collector helps?
import gc
gc.collect()
There is no reason for the CPU RAM to be used once the model is fully loaded on the GPU.
I did try gc.collect()
earlier today and that didn't release the CPU RAM memory. Now I tried to repeat the experiment just to make sure, and I couldn't even load the model because the
model = AutoModelForCausalLM.from_pretrained("PygmalionAI/pygmalion-6b", low_cpu_mem_usage=True, device_map='auto', torch_dtype=torch.float16)
call made the Colab session crash after running out of RAM.
After loading the model with the command above, doing this releases the VRAM but not the RAM:
import gc
model = None
gc.collect()
torch.cuda.empty_cache()
This looks exactly like https://github.com/huggingface/transformers/issues/21094. Are these two bugs related?
I've recreated it, report as follows:
(available_memory
returns the %
of memory available)
Working as expected (w/o big model inference, hooks, etc)
>>> import psutil, torch
>>> from transformers import AutoModelForCausalLM
>>> available_memory = lambda: psutil.virtual_memory().available * 100 / psutil.virtual_memory().total
>>> available_memory()
97.8753999829287
>>> model = AutoModelForCausalLM.from_pretrained("PygmalionAI/pygmalion-6b", low_cpu_mem_usage=True)
>>> available_memory()
69.87882027448968
>>> model = None()
>>> import gc
>>> gc.collect()
>>> available_memory()
97.28031713868933
Issue:
>>> available_memory()
97.28031713868933
>>> model = AutoModelForCausalLM.from_pretrained(
... "PygmalionAI/pygmalion-6b",
... low_cpu_mem_usage=True,
... device_map='auto',
... torch_dtype=torch.float16
... )
>>> available_memory()
95.77584944795181
>>> model = None
>>> gc.collect()
>>> torch.cuda.empty_cache()
>>> available_memory()
95.73520915357973
Note the fact that basically no memory was released here (on multiple repeated checks the RAM hit 95.77%)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I think that lazy loading models would be an important addition to transformers
in the context of loading models to Google Colab, but I am not sure how doable it is.
A workaround for now is to reshard the models.
Mmm, diving into the reproducer @muellerzr, it looks like memory is not released by PyTorch when moving the model to a device:
import psutil, torch
from transformers import AutoModelForCausalLM
available_memory = lambda: psutil.virtual_memory().available * 100 / psutil.virtual_memory().total
print(available_memory())
model = AutoModelForCausalLM.from_pretrained("PygmalionAI/pygmalion-6b", low_cpu_mem_usage=True)
model = model.to(0)
print(available_memory())
del model
import gc
gc.collect()
print(available_memory())
shows no memory is released.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
From the discussion, it seems to me that lazy loading is not the only issue. One also wants to garbage collect parts of the state dict that are no longer in use.
For the use-case of apply model deltas, this requires streaming out the updated model weights rather than waiting for all the deltas to be applied.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.