Results 9 comments of Yiltan

I have a use case for this feature when implementing MPI Partitioned Point-to-Point Communication in MPI Libraries. See Section IV-A in [1] for details. I can see that we could...

I extracted the relevenet code, it can be seen here: https://gist.github.com/Yiltan/648d19e8f6874b6c56222f1e07d47132 The worker progress that crashes is on line 291. This happens inconstantly, if it doesn't crash then we hang...

CPU Version is : `Intel(R) Xeon(R) Gold 6338 CPU` and PCIe is Gen4 (confirmed with lspci that we have x16 and that each line is 16GT/s) IOMMU is disabled and...

@Hunter1016 , it was never resolved, it ended up being a hardware limitation on the platform (check Nvidia-smi output)

I regularly write MPI code, so this shouldn't be too complicated to implement. I've started to look though the CPU version to get started. However, I do have questions regarding...

![llm c_train](https://github.com/karpathy/llm.c/assets/9093579/8e99f22b-8b91-4ca9-a22e-91e6bbea2b5b) The MPI version of this is mostly working at this point. I've tested it up to 8 nodes. It reduces training by many hours. @karpathy Do you still...

Both of your advice was quite useful. I ran it a little more an the losses eventually converged. However, validation loss >> training loss, which suggests over fitting. - Do...

@Akshay-Venkatesh @bureddy I was wondering if this addition to the cuda_ipc module could be up for discussion?

The functional tests should create output that measures performance for each operation