Lu junru

Results 11 comments of Lu junru

> For the testing purpose, I keep only one training file in -bert_data_path and name it as .train.pt the command runs without any problem (but only have about 2k examples...

Any one find solutions? I try to finetune 33B on 8 * A100 40G, 900G RAM. It consumes 680GB during training, and can not save in the final as saving...

@djaym7 I can train 3b, 7b and 13b under same environment. In particular, these 3 models consume normal RAM, e.g. 100G ~ 200G. However, the 33B will dramatically consume CPU...

@nrailgun Have you tried about w/ offload? In my case, I offload optimizer to RAM for 33B, and it do train smoothly. The issue occurs in saving.

@djaym7 Not yet, I recommend you to follow alpaca: https://github.com/tatsu-lab/stanford_alpaca. Most of settings are similar.

> Any one find solutions? I try to finetune 33B on 8 * A100 40G, 900G RAM. It consumes 680GB during training, and can not save in the final as...

@memray Exactly. I used deepspeed zero3 offloads + flash attention.

@memray You may probably test following official strategies, here's one from HF https://huggingface.co/docs/transformers/main_classes/deepspeed#how-to-choose-which-zero-stage-and-offloads-to-use-for-best-performance: First of all set batch size to 1 (you can always use gradient accumulation for any desired...

@s1ghhh Sure. Here's some configs: deepspeed: 0.9.2 torch: 2.0.1 (flash attention is in it) cuda: V11.3.109 I used 800G CPU RAM when I use batch 8, accumulation 2, and received...

@s1ghhh I'm afraid I can't right now. We hope to release it next month.