Sergey Lebedev

Results 12 issues of Sergey Lebedev

## What Optimization of CUDA executor copy task ## How ? Manual loop unrolling

Ready-for-Review

## What Reimplement TL/CUDA reduce scatter ring algorithm using multireduce executor operation ## Why ? Better bandwidth utilization, support arbitrary number of rings

Ready-for-Review

In PR #84 we are adding support for NCCL TL. If UCC was built with NCCL support TL NCCL might be selected by CLs for CUDA collectives i.e. when both...

## What Use nonblocking memory copy for non-inplace allgather ## Why ? Improves performance of allgather ring by running self memory copy operations concurrently with communication

## What Extending collective plugin interface (https://github.com/openucx/ucc/pull/156) with functions to create and destroy custom plugin context. Added fully function example of knomial allreduce using active messages of UCX.

## What Lazily initialize TL NCCL and TL CUDA on first CUDA collective. ## Why ? Both NCCL and CUDA require CUDA devices to be set before team create. In...

## What CL HIER should report global status on team create. ## Why ? It's possible that selection table may be different on different ranks if rank considers local status...

## What Add reduce scatter knomial algorithm in TL/UCP Performance: 8 nodes 64 ppn msgsize: | knomia us. | ring us. -- | -- | -- 4 | 29.45 |...

Ready-for-Review

## What Adding reduce to list of supported colls in CL HIER. ## Why ? UCC_COLL_TYPE_REDUCE was missing. CL HIER never selects 2step reduce algorithm.