Zheng Cai

Results 60 comments of Zheng Cai

> > > > I tried using this branch but got an error about not getting expected number of gradients during backward (15 vs 16) > > > > >...

> > > I tried using this branch but got an error about not getting expected number of gradients during backward (15 vs 16) > > > > > >...

> > > > > I tried using this branch but got an error about not getting expected number of gradients during backward (15 vs 16) > > > >...

We are trying this PR because we want mamba to process **packed sequence** like what has been done in transformer-based models. If we directly pad the sequence with zero, then...

> > > > > > I tried using this branch but got an error about not getting expected number of gradients during backward (15 vs 16) > > >...

> > > > > > > > I tried using this branch but got an error about not getting expected number of gradients during backward (15 vs 16) >...

I am curious because I met the same problem, it seems that the disk space of ray spilling continues to grow until out of disk error accurs.

> In general, yes. Which flavor of sequence parallelism are you referring to? The one in Megatron-LM? Thanks for your timely response! Sure. I am referring to the one in...