Yiltan
Yiltan
I have a use case for this feature when implementing MPI Partitioned Point-to-Point Communication in MPI Libraries. See Section IV-A in [1] for details. I can see that we could...
I extracted the relevenet code, it can be seen here: https://gist.github.com/Yiltan/648d19e8f6874b6c56222f1e07d47132 The worker progress that crashes is on line 291. This happens inconstantly, if it doesn't crash then we hang...
CPU Version is : `Intel(R) Xeon(R) Gold 6338 CPU` and PCIe is Gen4 (confirmed with lspci that we have x16 and that each line is 16GT/s) IOMMU is disabled and...
@Hunter1016 , it was never resolved, it ended up being a hardware limitation on the platform (check Nvidia-smi output)
I regularly write MPI code, so this shouldn't be too complicated to implement. I've started to look though the CPU version to get started. However, I do have questions regarding...
 The MPI version of this is mostly working at this point. I've tested it up to 8 nodes. It reduces training by many hours. @karpathy Do you still...
Both of your advice was quite useful. I ran it a little more an the losses eventually converged. However, validation loss >> training loss, which suggests over fitting. - Do...
@Akshay-Venkatesh @bureddy I was wondering if this addition to the cuda_ipc module could be up for discussion?
The functional tests should create output that measures performance for each operation