Stuck on model forward with 100% GPU-Util
When I run finetune/adapter.py with my dataset with almost no modification (with devices=2), the code stuck on logits = model(input_ids, lm_head_chunk_size=128)(https://github.com/Lightning-AI/lit-gpt/blob/main/finetune/adapter.py#L157C13-L157C62), the model () call nerver finish but GPU-Util is always 100%.
The process even can't be terminated with Ctrl-C
After replacing validate(fabric, model, val_data, tokenizer, max_iters=2)(https://github.com/Lightning-AI/lit-gpt/blob/main/finetune/adapter.py#L137) with model.train(), output error log with following:
torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: A different number of tensors was saved during the original forward and recomputation.
Number of tensors saved during forward: 49
Number of tensors saved during recomputation: 47
raise CheckpointError(
torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: A different number of tensors was saved during the original forward and recomputation.
Number of tensors saved during forward: 49
Number of tensors saved during recomputation: 47
The script param I used to run is:
python finetune/adapter.py \
--data_dir data/xxx \
--checkpoint_dir checkpoints/meta-llama/Llama-2-7b-chat-hf \
--out_dir out/adapter/xxx
Any tips will be appreciated.
After upgrade torch to nightly 2.3, the trainning process run without problem.
But another problem is that the trainning process cost too much GPU memory, with the config:
devices = 2
batch_size = 2 / devices
micro_batch_size = 1
cost 33944MiB each on two NVIDIA A100-SXM4-40GB.
Is that normal?
After one validate interval, it stucks again~ I don't know why~
@nullscc , what quantization level are you using(FP64, FP32, FP16, BFLOAT16)?
- typically you would need 4X the parameter count for 32-bit and 2 times the parameter count for 16-bit quantization level. In your case, 7b parameter would need 14gb(just for the model parameters and not the other miscellaneous items) in 16-bit. Feel free to also visit
tutorials/oom.md. - The token count in your fine tune dataset also adds to the memory needs during fine tuning. Feel free to check the max sequence length parameter in your prepare dataset function.
@mehrdad-es Thanks for you reply, I just use the default configuration. My main problem is just like https://github.com/Lightning-AI/lit-gpt/issues/856, but with a lot difference, it hang just like forever.
@nullscc I've had that before, could you try using a bit bigger GPU? it should not hang if utilization is 33GB/40GB