transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Lazy loading models on systems with more VRAM than RAM

Open oobabooga opened this issue 1 year ago • 11 comments

Feature request

I would like the ability to lazy load models to the GPU using AutoModelForCausalLM.from_pretrained.

At the moment, it is possible to reduce the RAM usage using the low_cpu_mem_usage=True option, but on systems with more VRAM than RAM (like Google Colab with 12GB RAM and 16GB VRAM), it is not possible to load certain models due to a RAM bottleneck.

Motivation

See above

Your contribution

--

oobabooga avatar Jan 30 '23 18:01 oobabooga

Could you please share a snippet of code that fails on such an env with device_map="auto" sent to from_pretrained? This loads the model directly on the GPU (as long as there is enough space) so this should work for your use case.

sgugger avatar Jan 30 '23 19:01 sgugger

Surely, here is a snippet that causes an out of memory error on Google Colab (the free instance with 12.7GB RAM and 15GB VRAM):

!pip install -U accelerate transformers

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("PygmalionAI/pygmalion-6b", device_map='auto')

I have tried every possible combination of .cuda() and low_cpu_mem_usage=True:

model = AutoModelForCausalLM.from_pretrained("PygmalionAI/pygmalion-6b", device_map='auto')
model = AutoModelForCausalLM.from_pretrained("PygmalionAI/pygmalion-6b", device_map='auto').cuda()
model = AutoModelForCausalLM.from_pretrained("PygmalionAI/pygmalion-6b", low_cpu_mem_usage=True, device_map='auto')
model = AutoModelForCausalLM.from_pretrained("PygmalionAI/pygmalion-6b", low_cpu_mem_usage=True, device_map='auto').cuda()

In all cases, the RAM usage steadily increases until it passes the 12GB mark and the Colab session crashes. On my machine, this model uses 11653.7 GiB VRAM and 2605.79 GiB RAM once fully loaded to the GPU, so in principle it should be possible to load it on Colab.

oobabooga avatar Jan 30 '23 20:01 oobabooga

I think you are missing a torch_dtype=torch.float16 or torch_dtype=torch.bfloat16 to get to 12GB of use. Otherwise the model will need 24GB of memory if it has 6b parameters (the default torch dtype in PyTorch being float32).

sgugger avatar Jan 30 '23 20:01 sgugger

You are correct, both of these allow me to load the model successfully:

model = AutoModelForCausalLM.from_pretrained("PygmalionAI/pygmalion-6b", low_cpu_mem_usage=True, device_map='auto', torch_dtype=torch.float16)
model = AutoModelForCausalLM.from_pretrained("PygmalionAI/pygmalion-6b", device_map='auto', torch_dtype=torch.float16)

But with these, the RAM usage after the model is loaded is very high: 12.2GB out of a total of 12.7GB. This makes the session very unstable and prone to crashing if other libraries are imported.

Is this high RAM usage normal? Can it be avoided?

oobabooga avatar Jan 30 '23 21:01 oobabooga

Can you try to see if adding a layer of garbage collector helps?

import gc

gc.collect()

There is no reason for the CPU RAM to be used once the model is fully loaded on the GPU.

sgugger avatar Jan 30 '23 21:01 sgugger

I did try gc.collect() earlier today and that didn't release the CPU RAM memory. Now I tried to repeat the experiment just to make sure, and I couldn't even load the model because the

model = AutoModelForCausalLM.from_pretrained("PygmalionAI/pygmalion-6b", low_cpu_mem_usage=True, device_map='auto', torch_dtype=torch.float16)

call made the Colab session crash after running out of RAM.

oobabooga avatar Jan 31 '23 01:01 oobabooga

After loading the model with the command above, doing this releases the VRAM but not the RAM:

import gc

model = None
gc.collect()
torch.cuda.empty_cache()

This looks exactly like https://github.com/huggingface/transformers/issues/21094. Are these two bugs related?

oobabooga avatar Feb 02 '23 17:02 oobabooga

I've recreated it, report as follows:

(available_memory returns the % of memory available)

Working as expected (w/o big model inference, hooks, etc)

>>> import psutil, torch
>>> from transformers import AutoModelForCausalLM
>>> available_memory = lambda: psutil.virtual_memory().available * 100 / psutil.virtual_memory().total
>>> available_memory()
97.8753999829287
>>> model = AutoModelForCausalLM.from_pretrained("PygmalionAI/pygmalion-6b", low_cpu_mem_usage=True)
>>> available_memory()
69.87882027448968
>>> model = None()
>>> import gc
>>> gc.collect()
>>> available_memory()
97.28031713868933

Issue:

>>> available_memory()
97.28031713868933
>>> model = AutoModelForCausalLM.from_pretrained(
...     "PygmalionAI/pygmalion-6b", 
...     low_cpu_mem_usage=True, 
...     device_map='auto', 
...     torch_dtype=torch.float16
... )
>>> available_memory()
95.77584944795181
>>> model = None
>>> gc.collect()
>>> torch.cuda.empty_cache()
>>> available_memory()
95.73520915357973

Note the fact that basically no memory was released here (on multiple repeated checks the RAM hit 95.77%)

muellerzr avatar Feb 13 '23 20:02 muellerzr

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Mar 10 '23 15:03 github-actions[bot]

I think that lazy loading models would be an important addition to transformers in the context of loading models to Google Colab, but I am not sure how doable it is.

A workaround for now is to reshard the models.

oobabooga avatar Mar 10 '23 15:03 oobabooga

Mmm, diving into the reproducer @muellerzr, it looks like memory is not released by PyTorch when moving the model to a device:

import psutil, torch
from transformers import AutoModelForCausalLM
available_memory = lambda: psutil.virtual_memory().available * 100 / psutil.virtual_memory().total
print(available_memory())

model = AutoModelForCausalLM.from_pretrained("PygmalionAI/pygmalion-6b", low_cpu_mem_usage=True)
model = model.to(0)
print(available_memory())

del model
import gc
gc.collect()
print(available_memory())

shows no memory is released.

sgugger avatar Mar 10 '23 19:03 sgugger

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Apr 04 '23 15:04 github-actions[bot]

From the discussion, it seems to me that lazy loading is not the only issue. One also wants to garbage collect parts of the state dict that are no longer in use.

For the use-case of apply model deltas, this requires streaming out the updated model weights rather than waiting for all the deltas to be applied.

jon-chuang avatar Apr 12 '23 01:04 jon-chuang

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar May 06 '23 15:05 github-actions[bot]