h2ogpt
h2ogpt copied to clipboard
Issues with 4-bit LORA training
I am trying to fine tune pythia in 4 bit mode and recreate h2ogpt-oig-oasst1-512-6_9b with the code in this repo. However I am experiencing some bugs with your code.
Let me first tell you about my setup I am working on 4 instances of p3.2xlarge machines ( 1 GPU each) and I have CUDA 11.8 and Python 3.9 installed. On all the nodes
- I clone your repo ( git commit 4673faac6 ) and run pip install -f requirements.txt and set train_4_bit to True.
- I git clone https://huggingface.co/EleutherAI/pythia-6.9b and check out the commit f271943e8
- I git clone https://huggingface.co/datasets/h2oai/h2ogpt-oig-oasst1-instruct-cleaned-v1 and check out the commit f3d12c5e4ea
Then I run the fine tuning procedure by executing this command on all nodes
NCCL_P2P_LEVEL=LOC WORLD_SIZE=4 CUDA_VISIBLE_DEVICES="0" torchrun --node_rank {node_rank} --nproc_per_node=1 --master_port=1234 --nnodes=4 --master_addr={master_addr} /path/to/h2ogpt/finetune.py --base_model=/path/to/pythia-6.9b --data_path=/path/to/h2ogpt-oig-oasst1-instruct-cleaned-v1/h2ogpt-oig-oasst1-instruct-cleaned-v1.json --prompt_type=plain --run_id=7 --micro_batch_size=8 --batch_size=512 --cutoff_len=512
The issues I am running into
- I find that the saved adapter_model is empty ( has a length of 443 bytes).
- If I comment out lines that modify the sate dict before saving the model, the saved model has a decently large size. This idea is based on this issue.
- Even after that, running export_hf_checkpoint fails to run, as this assert statement fails, indicating the LORA model was not eventually merged to the base model.
Thanks for the hint, I'll see what I can do to fix this.
This seems to work
diff --git a/finetune.py b/finetune.py
index ba62355..6ab2477 100644
--- a/finetune.py
+++ b/finetune.py
@@ -578,7 +578,7 @@ def train(
assert not trainer.is_model_parallel
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
- model.save_pretrained(output_dir)
+ model.save_pretrained(output_dir, state_dict=old_state_dict())
log("\n If there's a warning about missing keys above, please disregard :)")
torchrun --nproc_per_node=3 finetune.py --data_path=h2oai/openassistant_oasst1_h2ogpt_graded --drop_truncations=True --train_4bit=True --base_model=tiiuae/falcon-7b --micro_batch_size=1 --batch_size=3 --num_epochs=0.3
diff --git a/export_hf_checkpoint.py b/export_hf_checkpoint.py
index 2a6bcfb..0c8ddaa 100644
--- a/export_hf_checkpoint.py
+++ b/export_hf_checkpoint.py
@@ -26,6 +26,10 @@ def do_export():
LORA_WEIGHTS = 'llama-65b-hf.h2oaiopenassistant_oasst1_h2ogpt_graded.1_epochs.113510499324f0f007cbec9d9f1f8091441f2469.3'
OUTPUT_NAME = "h2ogpt-research-oasst1-llama-65b"
+ BASE_MODEL = 'tiiuae/falcon-7b'
+ LORA_WEIGHTS = 'falcon-7b.h2oaiopenassistant_oasst1_h2ogpt_graded.0.3_epochs.4673faac67ed27987d4d5a0dddcef43e2064f7e6.0'
+ OUTPUT_NAME = "falcon7"
+
llama_type = "llama" in BASE_MODEL
as_pytorch = False # False -> HF
python export_hf_checkpoint.py
Confirmed that the resulting model works:
python generate.py --base_model=falcon7 --prompt_type=human_bot
However, the checkpoint adapters are still 443 bytes, but didn't get affected by this change.
Merged dea3fdd9f44ed585d9a89b678dffb9411b055494
checking this version:
(base) arno@rippa:/nfs4/llm/h2ogpt(main)$ git diff
diff --git a/finetune.py b/finetune.py
index 6ab2477..ba45854 100644
--- a/finetune.py
+++ b/finetune.py
@@ -559,13 +559,13 @@ def train(
)
model.config.use_cache = False
- old_state_dict = model.state_dict
- from peft import get_peft_model_state_dict
-
- model.state_dict = (
- lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())
- ).__get__(model, type(model))
-
+ # old_state_dict = model.state_dict
+ # from peft import get_peft_model_state_dict
+ #
+ # model.state_dict = (
+ # lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())
+ # ).__get__(model, type(model))
+ #
if torch.__version__ >= "2" and sys.platform != "win32":
model = torch.compile(model)
# WIP (not generally replacing layers until pytorch 2.1)
@@ -578,7 +578,7 @@ def train(
assert not trainer.is_model_parallel
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
- model.save_pretrained(output_dir, state_dict=old_state_dict())
+ model.save_pretrained(output_dir)
log("\n If there's a warning about missing keys above, please disregard :)")
now checkpoints contain valid adapter files.
merging also worked fine, so this is an ever better fix.
was working with PEFT 3714aa2fff158fdfa637b2b65952580801d890b2
using the latest version of PEFT (189a6b8e357 as per main
branch), i can repro your error
works again with 0b62b4378b4ce9367932c73540349da9a41bdea8