Tim
Tim
As observed [here](https://github.com/jeffhammond/stencil-demo#results), CUDA stencil operation appears to be much slower than DPC++ across all block sizes on NVIDIA device. I also ran the problem (8000 grid points, 100 iteration)...
If I'm not mistaken, `MPIX_Waitall_enqueue()` would crash if the array contains a regular MPI_REQUEST_NULL due to failed [assert](https://github.com/pmodels/mpich/blob/895e74a6fd81a27ac6c07734fe465a0b464daf62/src/mpi/stream/stream_impl.c#L768). Conventionally given a list a communication tasks and array of requests, we...
## Details ___Do not mention proprietary info or link to internal work items in this PR.___ **Work item:** Internal **What were the changes?** Added core binding info after communicator init...
## Details Revised RCCL Replayer, also adding structured logging. **Work item:** Internal **What were the changes?** - adding replayer as a part of RCCL workload which **capture** RCCL API called,...