h2ogpt icon indicating copy to clipboard operation
h2ogpt copied to clipboard

Issues with 4-bit LORA training

Open ma1112 opened this issue 1 year ago • 5 comments

I am trying to fine tune pythia in 4 bit mode and recreate h2ogpt-oig-oasst1-512-6_9b with the code in this repo. However I am experiencing some bugs with your code.

Let me first tell you about my setup I am working on 4 instances of p3.2xlarge machines ( 1 GPU each) and I have CUDA 11.8 and Python 3.9 installed. On all the nodes

  • I clone your repo ( git commit 4673faac6 ) and run pip install -f requirements.txt and set train_4_bit to True.
  • I git clone https://huggingface.co/EleutherAI/pythia-6.9b and check out the commit f271943e8
  • I git clone https://huggingface.co/datasets/h2oai/h2ogpt-oig-oasst1-instruct-cleaned-v1 and check out the commit f3d12c5e4ea

Then I run the fine tuning procedure by executing this command on all nodes NCCL_P2P_LEVEL=LOC WORLD_SIZE=4 CUDA_VISIBLE_DEVICES="0" torchrun --node_rank {node_rank} --nproc_per_node=1 --master_port=1234 --nnodes=4 --master_addr={master_addr} /path/to/h2ogpt/finetune.py --base_model=/path/to/pythia-6.9b --data_path=/path/to/h2ogpt-oig-oasst1-instruct-cleaned-v1/h2ogpt-oig-oasst1-instruct-cleaned-v1.json --prompt_type=plain --run_id=7 --micro_batch_size=8 --batch_size=512 --cutoff_len=512

The issues I am running into

  • I find that the saved adapter_model is empty ( has a length of 443 bytes).
  • If I comment out lines that modify the sate dict before saving the model, the saved model has a decently large size. This idea is based on this issue.
  • Even after that, running export_hf_checkpoint fails to run, as this assert statement fails, indicating the LORA model was not eventually merged to the base model.

ma1112 avatar Jun 20 '23 08:06 ma1112

Thanks for the hint, I'll see what I can do to fix this.

arnocandel avatar Jun 20 '23 21:06 arnocandel

This seems to work

diff --git a/finetune.py b/finetune.py
index ba62355..6ab2477 100644
--- a/finetune.py
+++ b/finetune.py
@@ -578,7 +578,7 @@ def train(
         assert not trainer.is_model_parallel
     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
 
-    model.save_pretrained(output_dir)
+    model.save_pretrained(output_dir, state_dict=old_state_dict())
 
     log("\n If there's a warning about missing keys above, please disregard :)")

torchrun --nproc_per_node=3 finetune.py --data_path=h2oai/openassistant_oasst1_h2ogpt_graded --drop_truncations=True --train_4bit=True --base_model=tiiuae/falcon-7b --micro_batch_size=1 --batch_size=3 --num_epochs=0.3

diff --git a/export_hf_checkpoint.py b/export_hf_checkpoint.py
index 2a6bcfb..0c8ddaa 100644
--- a/export_hf_checkpoint.py
+++ b/export_hf_checkpoint.py
@@ -26,6 +26,10 @@ def do_export():
     LORA_WEIGHTS = 'llama-65b-hf.h2oaiopenassistant_oasst1_h2ogpt_graded.1_epochs.113510499324f0f007cbec9d9f1f8091441f2469.3'
     OUTPUT_NAME = "h2ogpt-research-oasst1-llama-65b"
 
+    BASE_MODEL = 'tiiuae/falcon-7b'
+    LORA_WEIGHTS = 'falcon-7b.h2oaiopenassistant_oasst1_h2ogpt_graded.0.3_epochs.4673faac67ed27987d4d5a0dddcef43e2064f7e6.0'
+    OUTPUT_NAME = "falcon7"
+
     llama_type = "llama" in BASE_MODEL
     as_pytorch = False  # False -> HF

python export_hf_checkpoint.py

Confirmed that the resulting model works: python generate.py --base_model=falcon7 --prompt_type=human_bot

image

However, the checkpoint adapters are still 443 bytes, but didn't get affected by this change.

arnocandel avatar Jun 20 '23 21:06 arnocandel

Merged dea3fdd9f44ed585d9a89b678dffb9411b055494

arnocandel avatar Jun 20 '23 22:06 arnocandel

checking this version:

(base) arno@rippa:/nfs4/llm/h2ogpt(main)$ git diff
diff --git a/finetune.py b/finetune.py
index 6ab2477..ba45854 100644
--- a/finetune.py
+++ b/finetune.py
@@ -559,13 +559,13 @@ def train(
     )
     model.config.use_cache = False
 
-    old_state_dict = model.state_dict
-    from peft import get_peft_model_state_dict
-
-    model.state_dict = (
-        lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())
-    ).__get__(model, type(model))
-
+    # old_state_dict = model.state_dict
+    # from peft import get_peft_model_state_dict
+    #
+    # model.state_dict = (
+    #     lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())
+    # ).__get__(model, type(model))
+    #
     if torch.__version__ >= "2" and sys.platform != "win32":
         model = torch.compile(model)
         # WIP (not generally replacing layers until pytorch 2.1)
@@ -578,7 +578,7 @@ def train(
         assert not trainer.is_model_parallel
     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
 
-    model.save_pretrained(output_dir, state_dict=old_state_dict())
+    model.save_pretrained(output_dir)
 
     log("\n If there's a warning about missing keys above, please disregard :)")

now checkpoints contain valid adapter files.

merging also worked fine, so this is an ever better fix.

arnocandel avatar Jun 20 '23 22:06 arnocandel

was working with PEFT 3714aa2fff158fdfa637b2b65952580801d890b2 using the latest version of PEFT (189a6b8e357 as per main branch), i can repro your error works again with 0b62b4378b4ce9367932c73540349da9a41bdea8

arnocandel avatar Jun 21 '23 02:06 arnocandel