marian-dev
marian-dev copied to clipboard
[RFC] Initial CPU MPI implementation
Description
This add initial CPU MPI support, fixing https://github.com/marian-nmt/marian-dev/issues/744 the limitation is that you need to have only one cpu-thread per process. It also only supports "global" sharding mode (for now).
Are you guys upstream interested in a CPU MPI implementation? Comments on my implementation are much welcome.
How to test
mpirun -n X /path/to/marian -c config.yml
as long as --cpu-threads: 1
Checklist
- [x] I have tested the code manually
- [ ] I have run regression tests
- [x] I have read and followed CONTRIBUTING.md
- [x] I have updated CHANGELOG.md
Windows builds fail properly and need to be fixed: some warnings from the new code show up and we treat them as errors.
@snukky thank you. I was more so asking whether we are at all interested to have this and is it a problem that for now it has the limitation that it would only work with one thread per process and only in "global" gradient shard mode.
Also, we don't have any MPI tests, or we do?
I guess having it doesn't hurt. Did you have a use case?
Also, we don't have any MPI tests, or we do?
No, I think we don't, but it would be great to have at least a unit test.
I guess having it doesn't hurt. Did you have a use case?
We're expecting some hardware that isn't made by nvidia.
@XapaJIaMnu Oh interesting. BTW, local sharding for a setup like 8 GPUs and 4 nodes is about 25% faster.
@emjotde yeah I looked at it, that's what I expect to be more efficient, but due to differences between MPI and NCCL, implementing it is a bit more complicated and I've put it on the queue.
Cheers,
Nick
Unless you would like to start combining it with NCCL communicator already?
MPI communicator offers a subset of the functionality that NCCL does. When (eventually) we want to allow multiple threads per process, the implementation of the two big collective operation functions will start differing.
What do you propose?
I thought it made more sense to keep separate communicators, but I admit there is more code duplication than what I would like. I can try to inherit the NCCL communicator, but it has a lot of cuda specific stuff that would need to be hidden away. Furthermore, intel provices oneCCL, an abstraction over MPI, which supports more datatypes than MPI (But only works with intel MPI). I guess we're not interested in having more than one communication backends as there would be very few interested users.
I'm ok with delaying the merge until I have a more complete implementation, as the code is fairly self contained, and i find it unlikely any changes to master would break it. (Unless you are planning on some extra MPI work?)