Zheng Cai
Zheng Cai
> > > > I tried using this branch but got an error about not getting expected number of gradients during backward (15 vs 16) > > > > >...
> > > I tried using this branch but got an error about not getting expected number of gradients during backward (15 vs 16) > > > > > >...
> > > > > I tried using this branch but got an error about not getting expected number of gradients during backward (15 vs 16) > > > >...
We are trying this PR because we want mamba to process **packed sequence** like what has been done in transformer-based models. If we directly pad the sequence with zero, then...
> > > > > > I tried using this branch but got an error about not getting expected number of gradients during backward (15 vs 16) > > >...
> > > > > > > > I tried using this branch but got an error about not getting expected number of gradients during backward (15 vs 16) >...
I am curious because I met the same problem, it seems that the disk space of ray spilling continues to grow until out of disk error accurs.
> In general, yes. Which flavor of sequence parallelism are you referring to? The one in Megatron-LM? Thanks for your timely response! Sure. I am referring to the one in...
Got it. Thanks!
Got it. Thank you Tri Dao!