sdtblck comments

Results 18 comments of


                                            sdtblck

Add function `to_sequential` to PipelineModule

> In addition to `to_sequential`, there may be another way we could accomplish this while keeping the normal `PipelineModule`, if that would be useful. > > If we short-circuit this...

Add function `to_sequential` to PipelineModule

> This is a great idea, thanks @sdtblck ! > > One caveat is that we lose the activation checkpointing that the `PipelineModule`'s forward can be configured to use. But...

Add function `to_sequential` to PipelineModule

Hi @ShadenSmith I think the two latest commits should fix both the above requirements. There is maybe some repeated code between `SequentialModel` and `PipelineModule` that could be slimmed down -...

Issue with space tokens + BPE tokenizer

ah ok, I think i misunderstood what the 'add_prefix_space' option does. I had assumed it controlled whether the space between words was at the beginning, or end of the token....

Checkpoint merge script

Ok @sweinbach I did some pretty extensive testing / debugging of this, and had to change a fair few things to get it to work at all, but I think...

Checkpoint merge script

@sweinbach anything else to add, or would you say this is ready to merge?

Simplify and relax dependencies

Aside from the above ^ lgtm :rocket:

Simplify and relax dependencies

> > I'm not sure of the motivation for the changed to fused_kernels.py Also, I really don't like the requirements being so granular like this. I don't see the need...

Migrate tensor parallelism code to use OSLO

@hyunwoongko actually in neox we also load onto the CPU and then move to the GPU, so i'm not sure this is a problem

Parallel all reduce communication and backprop

Hey @zhuzilin , really interesting! firstly, wrt to the speed difference between pp=0 and pp=1, we also found a similar thing, see https://github.com/EleutherAI/gpt-neox/pull/269 . Although maybe the speed difference isn't...