marian-dev icon indicating copy to clipboard operation
marian-dev copied to clipboard

[RFC] Initial CPU MPI implementation

Open XapaJIaMnu opened this issue 3 years ago • 9 comments

Description

This add initial CPU MPI support, fixing https://github.com/marian-nmt/marian-dev/issues/744 the limitation is that you need to have only one cpu-thread per process. It also only supports "global" sharding mode (for now).

Are you guys upstream interested in a CPU MPI implementation? Comments on my implementation are much welcome.

How to test

mpirun -n X /path/to/marian -c config.yml as long as --cpu-threads: 1

Checklist

  • [x] I have tested the code manually
  • [ ] I have run regression tests
  • [x] I have read and followed CONTRIBUTING.md
  • [x] I have updated CHANGELOG.md

XapaJIaMnu avatar Mar 17 '21 09:03 XapaJIaMnu

Windows builds fail properly and need to be fixed: some warnings from the new code show up and we treat them as errors.

snukky avatar Mar 19 '21 16:03 snukky

@snukky thank you. I was more so asking whether we are at all interested to have this and is it a problem that for now it has the limitation that it would only work with one thread per process and only in "global" gradient shard mode.

Also, we don't have any MPI tests, or we do?

XapaJIaMnu avatar Mar 20 '21 22:03 XapaJIaMnu

I guess having it doesn't hurt. Did you have a use case?

emjotde avatar Mar 20 '21 22:03 emjotde

Also, we don't have any MPI tests, or we do?

No, I think we don't, but it would be great to have at least a unit test.

snukky avatar Mar 21 '21 16:03 snukky

I guess having it doesn't hurt. Did you have a use case?

We're expecting some hardware that isn't made by nvidia.

XapaJIaMnu avatar Mar 21 '21 16:03 XapaJIaMnu

@XapaJIaMnu Oh interesting. BTW, local sharding for a setup like 8 GPUs and 4 nodes is about 25% faster.

emjotde avatar Mar 21 '21 16:03 emjotde

@emjotde yeah I looked at it, that's what I expect to be more efficient, but due to differences between MPI and NCCL, implementing it is a bit more complicated and I've put it on the queue.

Cheers,

Nick

XapaJIaMnu avatar Mar 22 '21 15:03 XapaJIaMnu

Unless you would like to start combining it with NCCL communicator already?

emjotde avatar Mar 23 '21 22:03 emjotde

MPI communicator offers a subset of the functionality that NCCL does. When (eventually) we want to allow multiple threads per process, the implementation of the two big collective operation functions will start differing.

What do you propose?

I thought it made more sense to keep separate communicators, but I admit there is more code duplication than what I would like. I can try to inherit the NCCL communicator, but it has a lot of cuda specific stuff that would need to be hidden away. Furthermore, intel provices oneCCL, an abstraction over MPI, which supports more datatypes than MPI (But only works with intel MPI). I guess we're not interested in having more than one communication backends as there would be very few interested users.

I'm ok with delaying the merge until I have a more complete implementation, as the code is fairly self contained, and i find it unlikely any changes to master would break it. (Unless you are planning on some extra MPI work?)

XapaJIaMnu avatar Mar 24 '21 15:03 XapaJIaMnu