litgpt Edited: Alternative methods to perform model loading & sharding operation such that only sharded model utilizes GPU VRAM

Hi Team, I'm fine-tuning a falcon-7B model using LoRA. I can see a PR #118 - "Use FSDP Everywhere" was made 3 weeks ago. The implementation is done for adapter & adapter_v2 methods but not for LoRA.

FSDP implementation

auto_wrap_policy = partial(transformer_auto_wrap_policy, transformer_layer_cls={Block})
strategy = FSDPStrategy(
    auto_wrap_policy=auto_wrap_policy,
    activation_checkpointing=Block,
    state_dict_type="full",
    limit_all_gathers=True,
)

Raised NotImplementedError in LoRA

if fabric_devices > 1:
        if tpu:
            # For multi-host TPU training, the device count for Fabric is limited to the count on a single host.
            fabric_devices = "auto"
            strategy = XLAStrategy(sync_module_states=False)
        else:
            raise NotImplementedError
    else:
        strategy = "auto"

Can I use the same implementation here too? Also, what are the possible reasons we're not using DeepSpeedStrategy anymore?

Jun 30 '23 11:06 ht0rohit

I'm working on a Jupyter like notebook & implemented ddp_notebook strategy as FSDP was giving an error for notebook environments.

Below is a code snippet from finetune/lora.py main method which first loads a copy of model in each of my 2 GPUs (occupying 15GB for each).

config = Config.from_name(name=checkpoint_dir.name, r=lora_r, alpha=lora_alpha, dropout=lora_dropout)
with fabric.init_module(empty_init=False):
    model = GPT(config)
    model.apply(model._init_weights)  # for the LoRA weights
with lazy_load(checkpoint_path) as checkpoint:
    # strict=False because missing keys due to LoRA weights not contained in state dict
    model.load_state_dict(checkpoint, strict=False)

Further, fabric.setup() is sharding the model onto these 2 GPUs, utilizing a peak memory of 23 GB.

trainable_params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.AdamW(trainable_params, lr=learning_rate, weight_decay=weight_decay)
model, optimizer = fabric.setup(model, optimizer)

Issue

The loaded & the sharded model, both are utilizing VRAM. Are there any alternative methods to perform this operation such that only sharded model utilizes GPU VRAM? I tried moving the loaded model to CPU after fabric.setup() step but got the following error while fine-tuning -

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when 
checking argument for argument index in method wrapper_CUDA__index_select)

Jul 03 '23 05:07 ht0rohit

If you use ddp_fork, that will not shard your model. If you want this, I suggest that you give up using a notebook for training as it cannot support FSDP.

Either way, I don't think fabric.setup() should increase the memory usage here. cc @awaelchli

Aug 14 '23 12:08 carmocca

+1 We won't be able to provide an implementation that works in Jupyter notebooks for training/finetuning on multiple GPUs. The scripts are made to be run directly.

Either way, I don't think fabric.setup() should increase the memory usage here. cc @awaelchli

That's right. Unless the model was on CPU and setup() has to move it to GPU if GPU was requested.

Aug 14 '23 18:08 awaelchli

I'll finish this #351 shortly (the PR will help consume less CPU memory when setting up the model) and I will take note of the memory consumption to post here to give you an idea what to expect.

Aug 14 '23 18:08 awaelchli