Jacob Danovitch

Results 32 comments of Jacob Danovitch

I'm having a similar problem when trying to run a flow that calls the same subflow multiple times, which itself calls tasks. The outer flow generates and loops over a...

> I did not carefully examine the difference between the existing trainer and the deep speed one, but it looks like they are almost the same? Yes, they are very...

See my comments in the issue thread for more detail. The slowdown seems to be related to gradient accumulation. The next steps are (1) seeing if the slowdown is reproducible...

@dirkgr I think this is ready to take a look at. Some notes thus far: * Deepspeed is heavily config-based and it's hard to avoid, so rather than fighting it,...

Thanks for looking it over! I'll start linting everything and getting the tests up and running (we can probably re-use the existing Trainer tests, yeah). As for the code duplication,...

> These are special `nn.Module`s that work particularly well with DeepSpeed? More or less, as far as I understand they're heavily optimized CUDA kernels that help for things like long...

Still working on deduplicating code (and linting). I was able to get a lot reduced almost the entire constructor) by lying to the `super().__init__()` and passing `distributed=False` so that it...

Got all the typechecks out the way, phew. I've also managed to cut out a lot of duplicated code, I think! The remainder is almost entirely checkpointing related. For loading/saving,...

> Is there a way to detect whether we are in a deepspeed context? If so, I'd be OK with some sort of `if not in_deepspeed:`. Otherwise, let's just duplicate...

Sounds good. I think Deepspeed might set some environment variables itself, similarly to torch, so I'll poke around to see if we can use one of those. If not we...