Edited: Alternative methods to perform model loading & sharding operation such that only sharded model utilizes GPU VRAM
Hi Team, I'm fine-tuning a falcon-7B model using LoRA. I can see a PR #118 - "Use FSDP Everywhere" was made 3 weeks ago. The implementation is done for adapter & adapter_v2 methods but not for LoRA.
FSDP implementation
auto_wrap_policy = partial(transformer_auto_wrap_policy, transformer_layer_cls={Block})
strategy = FSDPStrategy(
auto_wrap_policy=auto_wrap_policy,
activation_checkpointing=Block,
state_dict_type="full",
limit_all_gathers=True,
)
Raised NotImplementedError in LoRA
if fabric_devices > 1:
if tpu:
# For multi-host TPU training, the device count for Fabric is limited to the count on a single host.
fabric_devices = "auto"
strategy = XLAStrategy(sync_module_states=False)
else:
raise NotImplementedError
else:
strategy = "auto"
Can I use the same implementation here too? Also, what are the possible reasons we're not using DeepSpeedStrategy anymore?
I'm working on a Jupyter like notebook & implemented ddp_notebook strategy as FSDP was giving an error for notebook environments.
Below is a code snippet from finetune/lora.py main method which first loads a copy of model in each of my 2 GPUs (occupying 15GB for each).
config = Config.from_name(name=checkpoint_dir.name, r=lora_r, alpha=lora_alpha, dropout=lora_dropout)
with fabric.init_module(empty_init=False):
model = GPT(config)
model.apply(model._init_weights) # for the LoRA weights
with lazy_load(checkpoint_path) as checkpoint:
# strict=False because missing keys due to LoRA weights not contained in state dict
model.load_state_dict(checkpoint, strict=False)
Further, fabric.setup() is sharding the model onto these 2 GPUs, utilizing a peak memory of 23 GB.
trainable_params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.AdamW(trainable_params, lr=learning_rate, weight_decay=weight_decay)
model, optimizer = fabric.setup(model, optimizer)
Issue
The loaded & the sharded model, both are utilizing VRAM. Are there any alternative methods to perform this operation such that only sharded model utilizes GPU VRAM? I tried moving the loaded model to CPU after fabric.setup() step but got the following error while fine-tuning -
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when
checking argument for argument index in method wrapper_CUDA__index_select)
If you use ddp_fork, that will not shard your model. If you want this, I suggest that you give up using a notebook for training as it cannot support FSDP.
Either way, I don't think fabric.setup() should increase the memory usage here. cc @awaelchli
+1 We won't be able to provide an implementation that works in Jupyter notebooks for training/finetuning on multiple GPUs. The scripts are made to be run directly.
Either way, I don't think fabric.setup() should increase the memory usage here. cc @awaelchli
That's right. Unless the model was on CPU and setup() has to move it to GPU if GPU was requested.
I'll finish this #351 shortly (the PR will help consume less CPU memory when setting up the model) and I will take note of the memory consumption to post here to give you an idea what to expect.