marian-dev [RFC] Initial CPU MPI implementation

[RFC] Initial CPU MPI implementation

Open XapaJIaMnu opened this issue 3 years ago • 9 comments

Description

This add initial CPU MPI support, fixing https://github.com/marian-nmt/marian-dev/issues/744 the limitation is that you need to have only one cpu-thread per process. It also only supports "global" sharding mode (for now).

Are you guys upstream interested in a CPU MPI implementation? Comments on my implementation are much welcome.

How to test

mpirun -n X /path/to/marian -c config.yml as long as --cpu-threads: 1

Checklist

[x] I have tested the code manually
[ ] I have run regression tests
[x] I have read and followed CONTRIBUTING.md
[x] I have updated CHANGELOG.md

Mar 17 '21 09:03 XapaJIaMnu

Windows builds fail properly and need to be fixed: some warnings from the new code show up and we treat them as errors.

Mar 19 '21 16:03 snukky

@snukky thank you. I was more so asking whether we are at all interested to have this and is it a problem that for now it has the limitation that it would only work with one thread per process and only in "global" gradient shard mode.

Also, we don't have any MPI tests, or we do?

Mar 20 '21 22:03 XapaJIaMnu

I guess having it doesn't hurt. Did you have a use case?

Mar 20 '21 22:03 emjotde

Also, we don't have any MPI tests, or we do?

No, I think we don't, but it would be great to have at least a unit test.

Mar 21 '21 16:03 snukky

I guess having it doesn't hurt. Did you have a use case?

We're expecting some hardware that isn't made by nvidia.

Mar 21 '21 16:03 XapaJIaMnu

@XapaJIaMnu Oh interesting. BTW, local sharding for a setup like 8 GPUs and 4 nodes is about 25% faster.

Mar 21 '21 16:03 emjotde

@emjotde yeah I looked at it, that's what I expect to be more efficient, but due to differences between MPI and NCCL, implementing it is a bit more complicated and I've put it on the queue.

Cheers,

Nick

Mar 22 '21 15:03 XapaJIaMnu

Unless you would like to start combining it with NCCL communicator already?

Mar 23 '21 22:03 emjotde

MPI communicator offers a subset of the functionality that NCCL does. When (eventually) we want to allow multiple threads per process, the implementation of the two big collective operation functions will start differing.

What do you propose?

I thought it made more sense to keep separate communicators, but I admit there is more code duplication than what I would like. I can try to inherit the NCCL communicator, but it has a lot of cuda specific stuff that would need to be hidden away. Furthermore, intel provices oneCCL, an abstraction over MPI, which supports more datatypes than MPI (But only works with intel MPI). I guess we're not interested in having more than one communication backends as there would be very few interested users.

I'm ok with delaying the merge until I have a more complete implementation, as the code is fairly self contained, and i find it unlikely any changes to master would break it. (Unless you are planning on some extra MPI work?)

Mar 24 '21 15:03 XapaJIaMnu

marian-dev marian-dev copied to clipboard

[RFC] Initial CPU MPI implementation

Description

How to test

Checklist

marian-dev
marian-dev copied to clipboard