Quan Sun
Quan Sun
p.s. bsz can achieve 57k when using grad checkpoint & deepspeed fp16 & zero-stage-1 & local loss
@gabrielilharco @rwightman Thanks for your comments. I will work on these changes ASAP.
Hi @gabrielilharco. You are right. get_num_layer_for_transformer(...) is not flexible. It should warn users if the models are not supported. Do you think we can have a white list here? For...
Hi @gabrielilharco. have checked "Allow edits from maintainers." on my side. Please let me know if anything was missed.
Cool! Is this an implementation of GradAccum in [BASIC](https://arxiv.org/pdf/2111.10050.pdf)?
Just a follow-up. Is anyone taking a look?
> @Quan-Sun oh this is my error! thanks for the fix! never mind!
Hi @rom1504 Deepspeed may be another option for bigger models. It's also easy and effective to use(PR for this https://github.com/mlfoundations/open_clip/pull/264). It can be applied with older versions of Pytorch, such...
我们正在准备中英双语的版本,后面训练完成会第一时间更新。
@will-wiki 目前已经准备好了部分数据,正在进行预训练,按目前的进度,顺利的话大概需要1个月的时间