alignment-handbook icon indicating copy to clipboard operation
alignment-handbook copied to clipboard

Error when fine-tuning using the SFT with QLORA

Open Michelet-Gaetan opened this issue 1 year ago • 4 comments
trafficstars

Hello,

I tried to fine-tune a model using the SFT/QLoRA method provided in the handbook. Everything runs until the beginning of the training phase. At this moment, the following error occurs:

#Error# Traceback (most recent call last): File "/home/michelet/my_projects/fine_tuning/alignment-handbook-main/scripts/run_sft.py", line 233, in main() File "/home/michelet/my_projects/fine_tuning/alignment-handbook-main/scripts/run_sft.py", line 188, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/michelet/my_env/new_fine_tuning_env/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 451, in train output = super().train(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/michelet/my_env/new_fine_tuning_env/lib/python3.11/site-packages/transformers/trainer.py", line 1929, in train return inner_training_loop( ^^^^^^^^^^^^^^^^^^^^ File "/home/michelet/my_env/new_fine_tuning_env/lib/python3.11/site-packages/transformers/trainer.py", line 2202, in _inner_training_loop self.control = self.callback_handler.on_train_begin(args, self.state, self.control) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/michelet/my_env/new_fine_tuning_env/lib/python3.11/site-packages/transformers/trainer_callback.py", line 460, in on_train_begin return self.call_event("on_train_begin", args, state, control) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/michelet/my_env/new_fine_tuning_env/lib/python3.11/site-packages/transformers/trainer_callback.py", line 507, in call_event result = getattr(callback, event)( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/michelet/my_env/new_fine_tuning_env/lib/python3.11/site-packages/transformers/integrations/integration_utils.py", line 681, in on_train_begin self.tb_writer.add_text("args", args.to_json_string()) ^^^^^^^^^^^^^^^^^^^^^ File "/home/michelet/my_env/new_fine_tuning_env/lib/python3.11/site-packages/transformers/training_args.py", line 2471, in to_json_string return json.dumps(self.to_dict(), indent=2) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/michelet/anaconda3/lib/python3.11/json/init.py", line 238, in dumps **kw).encode(obj) ^^^^^^^^^^^ File "/home/michelet/anaconda3/lib/python3.11/json/encoder.py", line 202, in encode chunks = list(chunks) ^^^^^^^^^^^^ File "/home/michelet/anaconda3/lib/python3.11/json/encoder.py", line 432, in _iterencode yield from _iterencode_dict(o, _current_indent_level) File "/home/michelet/anaconda3/lib/python3.11/json/encoder.py", line 406, in _iterencode_dict yield from chunks File "/home/michelet/anaconda3/lib/python3.11/json/encoder.py", line 406, in _iterencode_dict yield from chunks File "/home/michelet/anaconda3/lib/python3.11/json/encoder.py", line 439, in _iterencode o = _default(o) ^^^^^^^^^^^ File "/home/michelet/anaconda3/lib/python3.11/json/encoder.py", line 180, in default raise TypeError(f'Object of type {o.class.name} ' TypeError: Object of type BitsAndBytesConfig is not JSON serializable

#More details# To provide more details, I'm using the following command: ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_sft.py recipes/zephyr-7b-beta/sft/config_qlora.yaml --load_in_4bit=true (Note that I left the recipe untouched, except for the percentage of samples taken from the dataset to speed up the issue reproduction). I also tried to reinstall a new venv several times and on different machines.

#What I did in the meantime# It looks like the self.to_dict() call serving as a parameter for the json.dumps() call returns a dictionary of training arguments (2471 training_args.py). One of these arguments is a dictionary itself, that contains a BitsAndBytesConfig object as one of its values. This object is not serializable, but the BitsAndBytesConfig class provides a function named to_dict() that can transform a BitsAndBytesConfig object into a dictionary. I tried to modify the to_dict() function of the TrainingArguments class (2444 training_args.py). I know this should not be done but it kind of solved the problem. By retrieving this BitsAndBytesConfig object, changing it to its dictionary form (by using the to_dict() function of BiysandBytesConfig objects), and replacing the object with the dictionary form in the arguments dictionary, the error is not triggered anymore, and the BitAndBytesConfig appears in the log file. I don't know if this is a logging matter only or if this modification might have screwed up the fine-tuning process (from what I could understand the result of this json.dumps() is added to a SummaryWriter which is used for logging purposes. I'm not sure though).

I don't know if this is related to my installation or my use of the handbook/recipes. Did anyone run into the same error? Or can someone reproduce it?

Have a nice day!

Michelet-Gaetan avatar Aug 07 '24 09:08 Michelet-Gaetan

This could be solved when you convert quantization_config in sft script to json with to_json() method.

I realized that this is the issue that transformers does not convert all nested config to JSON recursively.

deep-diver avatar Aug 07 '24 12:08 deep-diver

Thanks for your quick answer!

Should I do this conversion in the run_sft.py file from the alignment handbook? Something like <quantization_config=quantization_config.to_json()> on line 120?

Edit: Just tried and got an error telling that BitsAndBytes objects do not have a to_json() method

Michelet-Gaetan avatar Aug 07 '24 12:08 Michelet-Gaetan

Ah sorry it's to_dict

https://github.com/huggingface/transformers/blob/e0d82534cc95b582ab072c1bbc060852ba7f9d51/src/transformers/utils/quantization_config.py#L129

deep-diver avatar Aug 07 '24 12:08 deep-diver

Changing line 120 of alignment_handbook/scripts/run_sft.py from quantization_config=quantization_config to quantization_config=quantization_config.to_dict() worked!

Thank you for your help!

Should I close the issue now?

Michelet-Gaetan avatar Aug 07 '24 12:08 Michelet-Gaetan