litgpt Stuck on model forward with 100% GPU-Util

When I run finetune/adapter.py with my dataset with almost no modification (with devices=2), the code stuck on logits = model(input_ids, lm_head_chunk_size=128)(https://github.com/Lightning-AI/lit-gpt/blob/main/finetune/adapter.py#L157C13-L157C62), the model () call nerver finish but GPU-Util is always 100%.

The process even can't be terminated with Ctrl-C

After replacing validate(fabric, model, val_data, tokenizer, max_iters=2)(https://github.com/Lightning-AI/lit-gpt/blob/main/finetune/adapter.py#L137) with model.train(), output error log with following:

torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: A different number of tensors was saved during the original forward and recomputation.
Number of tensors saved during forward: 49
Number of tensors saved during recomputation: 47
    raise CheckpointError(
torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: A different number of tensors was saved during the original forward and recomputation.
Number of tensors saved during forward: 49
Number of tensors saved during recomputation: 47

The script param I used to run is:

python finetune/adapter.py \
    --data_dir data/xxx \
    --checkpoint_dir checkpoints/meta-llama/Llama-2-7b-chat-hf \
    --out_dir out/adapter/xxx

Any tips will be appreciated.

Dec 25 '23 23:12 nullscc

After upgrade torch to nightly 2.3, the trainning process run without problem.

But another problem is that the trainning process cost too much GPU memory, with the config:

devices = 2
batch_size = 2 / devices
micro_batch_size = 1

cost 33944MiB each on two NVIDIA A100-SXM4-40GB.

Is that normal?

Dec 26 '23 08:12 nullscc

After one validate interval, it stucks again~ I don't know why~

Dec 26 '23 11:12 nullscc

@nullscc , what quantization level are you using(FP64, FP32, FP16, BFLOAT16)?

typically you would need 4X the parameter count for 32-bit and 2 times the parameter count for 16-bit quantization level. In your case, 7b parameter would need 14gb(just for the model parameters and not the other miscellaneous items) in 16-bit. Feel free to also visit tutorials/oom.md.
The token count in your fine tune dataset also adds to the memory needs during fine tuning. Feel free to check the max sequence length parameter in your prepare dataset function.

Dec 29 '23 19:12 murdadesmaeeli

@mehrdad-es Thanks for you reply, I just use the default configuration. My main problem is just like https://github.com/Lightning-AI/lit-gpt/issues/856, but with a lot difference, it hang just like forever.

Jan 06 '24 08:01 nullscc

@nullscc I've had that before, could you try using a bit bigger GPU? it should not hang if utilization is 33GB/40GB

Jan 27 '24 17:01 murdadesmaeeli