Baizhou Zhang
Baizhou Zhang
>  Train the reward model阶段你走通了吗? pretrain用哪个啊,在哪里下载啊? pretrain不一定用下载的,只要是huggingface有的模型(比如'gpt2')都可以直接用。 这里我稍微修改了一下train_reward_model.py的源码,源码里面不知道为什么只用了bloom的模型,我给改成gpt2了,这样命令行的pretrain参数可以直接传'gpt2'. (但是'bloom'好像huggingface也有,所以你可以试试传一个'bloom') 稍微吐槽一下train_reward_model.py,model不知道为什么只有一个默认的bloom,如果可以的话最好像train_prompts.py那样改成三个, 然后传一个--model的参数。
您好,用1张24G的4090微调llama2-7B的话,可以尝试一下使用`GeminiPlugin`,将`placement_policy`设置为`static`, `offload_optim_frac`和`offload_param_frac`这两个参数调大(大到不会OOM为止)。 使用`LowLevelZeroPlugin`,把`cpu_offload`设置为True,或许也可以。
Hi, are you loading from a pretrained Huggingface checkpoint? Is the download time included in this hour?
> I am using 2 nodes, each node have 8 A100 40GB gpus. > I find gemini will first use 41GB cpu memory and 4-5GB gpu memory for each process(gpu)...
Hi, I just tested the loading speed on an 8xa800 80G node, and it took 75s to load a 7b model with Gemini Plugin. So the loading time for a...
Hi,what's your batch size on each gpu? The microbatch size is the unit of passing when using pipeline parallel. If your batch size is more than 1, I recommend lower...
If the OOM error happens before the training loop, initialize the model under `LazyInitContext` might solve the problem (the usage can be referred to `examples/language/llama2/pretrain.py`) If the OOM happens during...