Geun, Lim
Geun, Lim
The first picture above is an error when using the merge method after dpo learning using qlora. When SFT training the 7.8b model with 2 nodes (H100*8), we use a...
The first issue has been resolved since the reinstallation, thank you. After using the v0.5.0 version, I proceeded with the update this time, so tracking is difficult..
Then I'll solve it first by rolling back and using it.
https://github.com/huggingface/trl/issues/2864
I'm learning llamafy models. The same is issue with Qwen2.5. I'm using the settings below and I'm using zero3.json as the --deepspeed option. Please let me know if there is...
``` cache_dir: ~/cache environment: LOCAL_MACHINE debug: false deepspeed_config: deepspeed_config_file: /data/axolotl/deepspeed_configs/zero3.json deepspeed_hostfile: /data/axolotl/hosts/hostfile deepspeed_multinode_launcher: pdsh zero3_init_flag: true distributed_type: DEEPSPEED downcast_bf16: 'no' enable_cpu_affinity: false machine_rank: 0 main_process_ip: [main_ip] main_process_port: [main_port] main_training_function: main...
It works fine if you run it with yes 0.4.1 version. Or if you run it with cpu_offload in the current version, But it takes a very long time.