h2ogpt
h2ogpt copied to clipboard
Issues with Upgraded Pip Libs for LoRA Weights in 4-bit and 8-bit Training
Dear Support Team,
I recently upgraded my pip libraries, including transformers, peft, accelerate, and bitsandbytes, to support 4-bit training as opposed to the original 8-bit training. After doing so, I successfully completed the finetune process, but discovered that loading the LoRA weights had no effect on the vanilla model.
Upon further investigation and retraining the 8-bit model using the older version, I attempted the same 8-bit training with the new versions. However, I only obtained the vanilla 8-bit model answers, which were not comparable to those from the older 8-bit version. Despite checking for differences in peft_model.py and the save_and_load function specifically for adapters, I didn't find any useful information. This leads me to believe that the bug may be located elsewhere, and I am not able to pinpoint its exact location.
Unfortunately, I cannot compare the two 4-bit versions directly since the older version does not support 4-bit training.
My suspicion is that there may be a low-level API issue or another aspect preventing the model weights from loading properly in the newer versions. Alternatively, it is possible that the adapter LoRA weights are not training at all during the process.
According to some guy around the net, we need to use .cuda after pretrain at finetune.py, and load weights LoRA after quantize (.cuda) at generate.py, if that helps at all.
I would appreciate any assistance or insights you could provide to help resolve these issues.
The newer versions of libraries - from requirements_optional_4bit.txt bitsandbytes==0.39.0 transformers @ git+https://github.com/huggingface/transformers.git@17a55534f5e5df10ac4804d4270bf6b8cc24998d accelerate @ git+https://github.com/huggingface/accelerate.git@7d24bdefb5b3252505151d8c1ac0efbed3574857 peft @ git+https://github.com/huggingface/peft.git@3714aa2fff158fdfa637b2b65952580801d890b2
The older versions of libraries ( which actually works well!, but no 4bit qlora) - from requirements.txt bitsandbytes==0.38.1 transformers==4.28.1 accelerate==0.18.0 peft @ git+https://github.com/huggingface/peft.git@098962fa6515f2e4fe83a757f5995d3ffbb1c373
opened another bug in Peft - https://github.com/huggingface/peft/issues/512 in case the bug does not concerns h2ogpt at all.
Thanks. @arnocandel might have some insights as he did training in 4-bit recently, while I've not done that yet.
We have had some issue in past where the lora state was only saved as checkpoints, and the final model saved was essentially some kind of empty shell. I had to copy the checkpoint model as the adapter model and use that instead. I'm unsure if that was ever resolved/understood, @arnocandel ?
yes, that is the problem! I found out the dict is empty.. when I am checking the adapter_weights. how did you fix it specifically? I'll mock. at the checkpoints there is a difference between adapter.bin to torch.bin in checkpoint, so I am not sure how to move the checpoint to the final adapater.bin. Thanks @pseudotensor
@orellavie1212 I just ensured there were checkpoints every so often. Then I copy over the last checkpoint torch model to adapter_model.bin.
E.g.:
jon@pseudotensor:/data/jon/snap_llama3/llama-30b-hf.h2oaiopenassistant_oasst1_h2ogpt.8.0_epochs.31eef248d53c9f39e51c60b8b030c1e3cafc34b0.llama30b_7$ ls -alrt
total 798944
drwx------ 3 jon jon 4096 Apr 26 22:28 runs/
drwx------ 2 jon jon 4096 Apr 26 23:29 checkpoint-6000/
drwx------ 2 jon jon 4096 Apr 27 00:29 checkpoint-12000/
drwx------ 2 jon jon 4096 Apr 27 01:30 checkpoint-18000/
drwx------ 2 jon jon 4096 Apr 27 02:31 checkpoint-24000/
drwx------ 2 jon jon 4096 Apr 27 03:32 checkpoint-30000/
drwx------ 2 jon jon 4096 Apr 27 04:33 checkpoint-36000/
drwx------ 2 jon jon 4096 Apr 27 05:34 checkpoint-42000/
drwx------ 2 jon jon 4096 Apr 27 06:35 checkpoint-48000/
-rw------- 1 jon jon 380 Apr 27 06:38 adapter_config.json
drwx------ 11 jon jon 4096 Apr 27 06:38 ./
drwx------ 18 jon jon 4096 Apr 27 22:10 ../
-rw------- 1 jon jon 818063245 Apr 28 00:00 adapter_model.bin
jon@pseudotensor:/data/jon/snap_llama3/llama-30b-hf.h2oaiopenassistant_oasst1_h2ogpt.8.0_epochs.31eef248d53c9f39e51c60b8b030c1e3cafc34b0.llama30b_7$
jon@pseudotensor:/data/jon/snap_llama3/llama-30b-hf.h2oaiopenassistant_oasst1_h2ogpt.8.0_epochs.31eef248d53c9f39e51c60b8b030c1e3cafc34b0.llama30b_7$ ls -alrt checkpoint-48000/
total 2403104
-rw------- 1 jon jon 14583 Apr 27 06:35 rng_state_7.pth
-rw------- 1 jon jon 14583 Apr 27 06:35 rng_state_6.pth
-rw------- 1 jon jon 14583 Apr 27 06:35 rng_state_5.pth
-rw------- 1 jon jon 14583 Apr 27 06:35 rng_state_4.pth
-rw------- 1 jon jon 14583 Apr 27 06:35 rng_state_3.pth
-rw------- 1 jon jon 14583 Apr 27 06:35 rng_state_2.pth
-rw------- 1 jon jon 14583 Apr 27 06:35 rng_state_1.pth
-rw------- 1 jon jon 3899 Apr 27 06:35 training_args.bin
-rw------- 1 jon jon 499723 Apr 27 06:35 tokenizer.model
-rw------- 1 jon jon 715 Apr 27 06:35 tokenizer_config.json
-rw------- 1 jon jon 423 Apr 27 06:35 special_tokens_map.json
-rw------- 1 jon jon 818063245 Apr 27 06:35 pytorch_model.bin
-rw------- 1 jon jon 627 Apr 27 06:35 scheduler.pt
-rw------- 1 jon jon 557 Apr 27 06:35 scaler.pt
-rw------- 1 jon jon 1636183613 Apr 27 06:35 optimizer.pt
-rw------- 1 jon jon 5855915 Apr 27 06:35 trainer_state.json
-rw------- 1 jon jon 14583 Apr 27 06:35 rng_state_0.pth
drwx------ 2 jon jon 4096 Apr 27 06:35 ./
drwx------ 11 jon jon 4096 Apr 27 06:38 ../
jon@pseudotensor:/data/jon/snap_llama3/llama-30b-hf.h2oaiopenassistant_oasst1_h2ogpt.8.0_epochs.31eef248d53c9f39e51c60b8b030c1e3cafc34b0.llama30b_7$
So you can see the adapter_model.bin was copied over later from checkpoint-48000/pytorch_model.bin.
On the main directory I have dapter_config.json adapter_model.bin checkpoint-69 checkpoint-72 checkpoint-75 runs On the specific directory (checkpoint-75) optimizer.pt pytorch_model.bin rng_state.pth scaler.pt scheduler.pt special_tokens_map.json tokenizer_config.json tokenizer.json trainer_state.json training_args.bin
you only need to change the pytorch_model.bin into adapter_model.bin at the main directory? they are actually the same (last checkpoint of course)?
It's not super clean, but you can get idea of what is required for LORA from https://huggingface.co/h2oai/h2ogpt-research-oig-oasst1-512-30b-lora/tree/main for same 30b.
Need adapter_config.json, adapter_model.bin, tokenizer.model, and tokenizer_config.json in some path used below.
I think with those alone, you can do:
python generate.py --base_model=<base model HF name or local path> --lora_weights=<path to those lora files>
And that should work.
Similar with finetune.py, one can pass lora_weights in and continue tuning.
E.g. gpt2.h2o.ai and my own computer were for a while running 30B as lora only, with files like:
total 799408
-rw-rw-r-- 1 jon jon 499723 Apr 28 22:30 tokenizer.model
-rw-rw-r-- 1 jon jon 715 Apr 28 22:30 tokenizer_config.json
-rw-rw-r-- 1 jon jon 423 Apr 28 22:30 special_tokens_map.json
-rw------- 1 jon jon 818063245 Apr 28 22:32 adapter_model.bin
-rw------- 1 jon jon 380 Apr 28 22:33 adapter_config.json
drwx------ 2 jon jon 4096 May 6 01:58 ./
drwx------ 85 jon jon 4096 May 11 22:06 ../
jon@pseudotensor:/data/jon/llama-30b-hf.h2oaih2ogpt-oig-oasst1-instruct-cleaned-v2.2.0_epochs.131f6d098b43236b5f91e76fc074ad089d6df368.llama30b_17$
and
GPT_H2O_AI=1 SAVE_DIR=./save/ CONCURRENCY_COUNT=1 python generate.py --base_model=decapoda-research/llama-30b-hf --lora_weights=/data/jon/llama-30b-hf.h2oaih2ogpt-oig-oasst1-instruct-cleaned-v2.2.0_epochs.131f6d098b43236b5f91e76fc074ad089d6df368.llama30b_17 --height=800 --use_auth_token=True --infer_devices=False --share=True --prompt_type=human_bot --langchain_mode='wiki_full' --visible_langchain_modes="['wiki_full', 'UserData', 'MyData', 'github h2oGPT', 'DriverlessAI docs']"
You probably won't do wiki_full but you get the idea.
I have adapter_model.bin but it is empty as I said. I tried to understand what you offered, but the only possible I see is to take pytorch_model.bin from the checkpoint folder, and change its name to adapter_model.bin and hope it will work. The currently adapter_model.bin is empty as I loaded it with torch.load. You said that you successfully loaded a checkpoint, but GPT_H2O_AI=1 SAVE_DIR=./save/ CONCURRENCY_COUNT=1 python generate.py --base_model=decapoda-research/llama-30b-hf --lora_weights=/data/jon/llama-30b-hf.h2oaih2ogpt-oig-oasst1-instruct-cleaned-v2.2.0_epochs.131f6d098b43236b5f91e76fc074ad089d6df368.llama30b_17 --height=800 --use_auth_token=True --infer_devices=False --share=True --prompt_type=human_bot --langchain_mode='wiki_full' --visible_langchain_modes="['wiki_full', 'UserData', 'MyData', 'github h2oGPT', 'DriverlessAI docs']" lora weights flag is looking for /data/jon/llama-30b-hf.h2oaih2ogpt-oig-oasst1-instruct-cleaned-v2.2.0_epochs.131f6d098b43236b5f91e76fc074ad089d6df368.llama30b_17, which contains the adapter_model.bin, which is empty now... The only .bin file in the checkpoint folder (75) could be torch_model.bin, which I could rename it and move to the main folder (adapter_model.bin)
@orellavie1212 Yes, what I mentioned is you should copy the checkpoint pytorch_model.bin to overwrite the bad adapter_model.bin
yes, the weights actually loaded successful now. Any idea if the last checkpoint is exactly the last training? or I missed some of the epoch, so adapter_bin.model is advanced in some sense
This was fixed in above PR