fsdp_qlora Q on comparison with SFTTrainer

Q on comparison with SFTTrainer

Open RonanKMcGovern opened this issue 10 months ago • 0 comments

The README mentions:

The SFTTrainer version has to run with a lower batch size (4 vs 8) so we only do 2 gradient accumulation steps vs 4 in the QLoRA+FSDP version.

Is this reversed? If the batch size is smaller with SFTTrainer, wouldn't you use higher gradient accumulation?

Separately, I note that SFTTrainer and fsdp trainings take the same time on the graph shown. I assume SFTTrainer is using DDP, so it should be quite a bit slower, no? Perhaps even close to 2x slower because the batch size is smaller so there are more forward passes required?

Apr 01 '24 11:04 RonanKMcGovern

fsdp_qlora fsdp_qlora copied to clipboard

Q on comparison with SFTTrainer

fsdp_qlora
fsdp_qlora copied to clipboard