allennlp Deepspeed integration

Draft for #4634 . Small change to allennlp.training.metrics.categorical_accuracy addresses my comment in #4623 . Still very rough, but functional.

Example config: gist

Oct 02 '20 14:10 jacobdanovitch

I did not carefully examine the difference between the existing trainer and the deep speed one, but it looks like they are almost the same?

Oct 03 '20 01:10 dirkgr

I did not carefully examine the difference between the existing trainer and the deep speed one, but it looks like they are almost the same?

Yes, they are very similar. The main differences are that deepspeed's model engine handles things like gradient accumulation, norm, clipping, schedulers, so I removed a lot of that functionality, and modified the backprop step. I also may need to adjust the checkpointing.

This is an MWE, but like I was talking about in the issue thread, I think it could be further optimized by avoiding the direct use of their model engine, which I'll take a look at next.

Oct 03 '20 15:10 jacobdanovitch

See my comments in the issue thread for more detail. The slowdown seems to be related to gradient accumulation. The next steps are (1) seeing if the slowdown is reproducible on other machines and (2) confirming the right place in the library for this to go. Once those are both resolved I'm going to simplify the model engine using Registrables and then this should be ~reviewable.

Oct 07 '20 16:10 jacobdanovitch

@dirkgr I think this is ready to take a look at. Some notes thus far:

Deepspeed is heavily config-based and it's hard to avoid, so rather than fighting it, I just tried to make the config itself registrable. So everything can be handled from within one config file.
That said, some things (like optimizers) can be instantiated directly, so I've made it possible to either pass an optimizer to the trainer as normal, or pass an optimizer config to deepspeed. If it's a bit confusing to have both ways, it'll be easy to pick one and stick with it.
There's a lot of code overlap with the GradientDescent trainer, still. I don't know if this is avoidable without breaking the trainer into various hooks like Lightning does.
I've only worked on the optimization/scaling-related features, I haven't looked at Sparse Attention / the transformer kernel yet (in allennlp or in general). I can try to add these things or just wait until the existing stuff is done with.

Nov 05 '20 18:11 jacobdanovitch

I haven't looked at Sparse Attention / the transformer kernel yet (in allennlp or in general).

These are special nn.Modules that work particularly well with DeepSpeed?

Nov 05 '20 23:11 dirkgr

Thanks for looking it over! I'll start linting everything and getting the tests up and running (we can probably re-use the existing Trainer tests, yeah).

As for the code duplication, the biggest overlaps are:

Checkpointing (I don't think this will be too duplicated when all is said and done)
Certain parts of _train_epoch
Almost all of _validation_loss

With Deepspeed, their training looks like:

for batch in data_loader:
    loss = model_engine(**batch) # <- same as GDTrainer, but different attribute
    model_engine.backward(loss) # <- different
    model_engine.step() # <- different

The model_engine handles gradient accumulation, learning rate scheduling, optimization, etc, so these specific parts end up different but then the rest is all the same. That includes stuff in _train_epoch like:

CPU/GPU memory usage
Regularization penalty
Tensorboard logging
Metrics / progress bar stuff

And then almost all of _validation_loss is duplicated, I'm pretty sure the only change I made was model_engine.backwards(loss).

I think I'll be able to inherit train directly as well if I inherit from the GD trainer, but I wouldn't be able to use its constructor, so I'd have to do something like:

@Trainer.register("deepspeed", constructor="from_partial_objects")
class DeepspeedTrainer(GradientDescentTrainer):
    Trainer.__init__(self, serialization_dir, cuda_device, distributed, local_rank, world_size)

I don't immediately see any problems with that, so if that sounds good that should help reduce duplication too. Overall, the problematic overlap is where I just have to call one thing differently, like model_engine.backwards(loss). Lightning's hooks pattern helps reduce this a bit, but that would be a bit more of a refactor than just changing a few lines.

Nov 09 '20 19:11 jacobdanovitch

These are special nn.Modules that work particularly well with DeepSpeed?

More or less, as far as I understand they're heavily optimized CUDA kernels that help for things like long sequences / are more efficient in general.

Nov 09 '20 19:11 jacobdanovitch

class DeepspeedTrainer(GradientDescentTrainer):
    Trainer.__init__(self, serialization_dir, cuda_device, distributed, local_rank, world_size)

I don't understand this. Do you mean you would not be able to do this normal thing?

class DeepspeedTrainer(GradientDescentTrainer):
    def __init__(...):
       super().__init__(...)

Nov 11 '20 00:11 dirkgr

If it's too difficult to not duplicate code, let's not do it. I looked at the code for _validation_loss, and while it would be a shame to have all those lines twice, it's pretty tightly integrated there, and I wouldn't want to compromise the regular trainer for this too much.

Nov 11 '20 01:11 dirkgr

Still working on deduplicating code (and linting). I was able to get a lot reduced almost the entire constructor) by lying to the super().__init__() and passing distributed=False so that it wouldn't try to setup DDP. Also gave up on the sparse attention embedder, too much stuff needed to be installed with root and I couldn't get around it.

Just so I know explicitly how far I should go, should I just not touch anything at all inside of the GradientDescentTrainer? The stuff inside of for batch_group in batch_group_generator_tqdm: in _try_train could potentially be trainer-dependent while everything outside of and around it should be common to all trainers, so that could be a useful thing to put in a hook-like method to prevent a bunch of repetition. But if you'd rather I just not touch anything at all I'll leave it.

Some nice news is that deepspeed is now pip installable, so it's a lot easier to get everything configured.

Nov 23 '20 19:11 jacobdanovitch

I'm fine with small modifications to the regular trainer, but what you're proposing sounds like a bigger deal, so let's hold off on that. We may want to re-visit when and if deep speed proves permanently useful.

Nov 25 '20 19:11 dirkgr

Got all the typechecks out the way, phew. I've also managed to cut out a lot of duplicated code, I think! The remainder is almost entirely checkpointing related. For loading/saving, there's a bit of duplication here and there but nothing overwhelming, and the rest is delegated to deepspeed. The last thing that would be really, really nice to get around would be the dist.barrier() calls in GradientDescentTrainer:

Lines 1006-1008, 1091-1093

# Wait for the master to finish saving the model checkpoint
            if self._distributed:
                dist.barrier()

I could be wrong with my limited understanding of DDP, but as far as I can tell, this causes a fatal hang for deepspeed, which also calls dist.barrier() within checkpointing. This seems like it could be a fairly common situation for anyone trying to override the trainer for any distributed-related stuff (like Fairscale or maybe the upcoming huggingface model parallelism stuff).

Do you think there's a clean solution to this? That's about ~150 LOC duplicated for the removal of 4 lines, which isn't great. Is there a way that this could be delegated to the checkpointer itself, perhaps? Once that's settled, it should just be tests / dependencies left to do.

Nov 30 '20 16:11 jacobdanovitch

Is there a way to detect whether we are in a deepspeed context? If so, I'd be OK with some sort of if not in_deepspeed:. Otherwise, let's just duplicate it.

Dec 05 '20 01:12 dirkgr

Is there a way to detect whether we are in a deepspeed context? If so, I'd be OK with some sort of if not in_deepspeed:. Otherwise, let's just duplicate it.

I mean, one easy way would just be to set an environment variable DEEPSPEED=1 :stuck_out_tongue: Would that work, or would you rather more a more general solution?

Dec 05 '20 02:12 jacobdanovitch

If deepspeed doesn't have some sort of global context (like dist does), then let's duplicate the code. I'm not that comfortable with inventing our own global flags, but if they are already there, defined and documented by major libraries, I'm OK using them.

Dec 05 '20 02:12 dirkgr

Sounds good. I think Deepspeed might set some environment variables itself, similarly to torch, so I'll poke around to see if we can use one of those. If not we can just duplicate for now and I'll proceed onto testing.

Dec 05 '20 02:12 jacobdanovitch

Took a holiday break from this while our cluster was down for maintenance for a bit. Turns out that checkpointing/barrier issue might be more complicated than I thought, but not sure if it's something to do with our cluster (the networking seems buggy, all_reduce and such often hang outside of allennlp for me). It's freezing while collecting memory usage now, which is odd, so still trying to figure that out.

Outside of that, I have a basic test working (well, when the above works), but it's more complicated than the existing trainer tests because all it's really doing is testing distributed training, which requires dist.init_process_group and the whole works. I don't see that being done for the gradient descent trainer, so not sure if I should be doing that.

Jan 03 '21 19:01 jacobdanovitch

Collecting memory usage often defers to shelling out to nvidia-smi, which holds a lock. If you have a lot of processes calling nvidia-smi, you can see deadlocks. We’ve had this happen on our clusters as well.

Distributed training has tests. It’s tested from the test for the training command, not on the trainer directly: https://github.com/allenai/allennlp/blob/main/tests/commands/train_test.py#L191

You could do the same thing for the deep speed trainer. Just test it end-to-end, not individually.

Jan 07 '21 01:01 dirkgr

Ah I think I see the real issue here. It's not the logging itself hanging.

(All ranks) My trainer tells my checkpointer to save if it's the master process
(Rank 0) My checkpointer delegates to DeepspeedEngine.save_checkpoint.
(Rank 0) DeepspeedEngine.save_checkpoint calls dist.barrier()
(Ranks 1-n) The other workers never get a chance to make the above barrier call, which leads to a lock as reported in pytorch here.

Any subsequent distributed operation, including the all_reduce in memory logging, then hangs (not sure how they make it through the barriers honestly). So, in fact, the real culprits were not the dist.barrier() calls I mentioned above, but rather the if self._master and self._checkpointer is not None: checks. Everything works perfectly when removing the first half.

So circling back to the code duplication issue, it's not so much being in a deepspeed context that I need to check for, it's something like:


# trainer.py => _try_train

if self._checkpointer is not None and self._checkpointer.call_on_rank(self._rank):

# checkpointer

class Checkpointer(Registrable):
    def call_on_rank(self, rank: int) -> bool:
        return rank == 0

# deepspeed checkpointer

class DeepspeedCheckpointer(Checkpointer):
    @overrides
    def call_on_rank(self, rank: int) -> bool:
        return True

So if you're open to adding some sort of flag like that to the base checkpointer, that would solve the issue. Or, we could check if it's a deepspeed checkpointer:

if self._checkpointer is not None and (self._master or isinstance(self._checkpointer, DeepspeedCheckpointer)):

But that might cause some circular import issues/issues for those who want to install without deepspeed. Either one would let me completely eliminate my override of _try_train; as long as I have a way to force self._checkpointer.save_checkpoint(epoch, self) to get called in every process.

You could do the same thing for the deep speed trainer. Just test it end-to-end, not individually.

Yep this worked perfectly, thanks. Exact same as test_train_model_distributed but different config. Should I decorate it with @requires_multi_gpu?

Jan 08 '21 20:01 jacobdanovitch

Why isn't the checkpointing thing a problem outside of AllenNLP? This should be an issue with DeepSpeed all the time, right?

Should I decorate it with @requires_multi_gpu?

It makes sense to me even abstractly that something like Deepspeed can only be accurately tested on a multi-GPU box.

Jan 08 '21 23:01 dirkgr

Why isn't the checkpointing thing a problem outside of AllenNLP? This should be an issue with DeepSpeed all the time, right?

Their typical training loop is something like (source):

# load checkpoint
for step, batch in enumerate(data_loader):
    loss = model_engine(batch)
    model_engine.backward(loss)
    model_engine.step()

    if step % args.save_interval:
        ckpt_id = loss.item()
        model_engine.save_checkpoint(args.save_dir, ckpt_id)

Note that all of these calls are made on every worker, so every worker enters the model_engine.save_checkpoint function and hit the dist.barrier call(s) contained within. I believe that if you did if step % args.save_interval and dist.get_rank() == 0 then you'd have the same problem. It's sort of a case of Deepspeed and AllenNLP both trying to help the user and running into each other as a result.

Jan 10 '21 15:01 jacobdanovitch

I see. We can always determine our rank, right? So we could just move the rank check into the checkpointer. The regular checkpointer will say if rank() != 0: return, and the DeepSpeed one will proceed with all the ranks. Does that work?

Jan 14 '21 22:01 dirkgr

Yeah that should work perfectly, I'll give it a try.

Jan 14 '21 22:01 jacobdanovitch

Hey, are there any news regarding this PR? I'd be interested in using Deepspeed with Allennlp.

Apr 11 '21 20:04 yanaiela

@epwalsh is testing it as we speak!

Apr 13 '21 21:04 dirkgr

@dirkgr @yanaiela @jacobdanovitch any blockers on this PR? Happy to help answer any deepspeed related questions that might be causing issues here.

Jul 30 '21 17:07 jeffra

We recently integrated FairScale to get the ZeRO optimizer into AllenNLP. It would be interesting to have DeepSpeed as well, since it has more features, but it's no longer quite so pressing. If anyone wants to pick up @jacobdanovitch's work and bring it over the finish line, I'd be happy to work with you.

Aug 02 '21 20:08 dirkgr

allennlp allennlp copied to clipboard

Deepspeed integration

allennlp
allennlp copied to clipboard