cylon icon indicating copy to clipboard operation
cylon copied to clipboard

build: UCC integration

Open nirandaperera opened this issue 2 years ago • 8 comments

References:

  1. UCC
  2. torch-ucc
  3. torch-ucc fb research

Note: UCX requires 1.11<= (current conda is 1.12 which works!)

Roadmap:

  • [x] Build UCC and UCX locally - Tested with conda ucx installation

  • [x] Incorporate UCC to Cylon as a part of UCX build. Allow UCC libs and headers to be provided externally (-DCYLON_UCX=ON -DUCC_PREFIX=<ucc install path>)

  • [x] Add UCC to current UCX context (currently MPI is used to spawn processes. This would be an easy entry point) - Use torch-ucc as a reference impl (these were resolved by #591)

  • [ ] #595

  • [ ] #594

nirandaperera avatar Mar 07 '22 17:03 nirandaperera

@nirandaperera I did the first step too. Building along with MPI didn't work. I reported it here: https://github.com/openucx/ucc/issues/436

vibhatha avatar Mar 08 '22 09:03 vibhatha

They've now added a comprehensive example, which we can directly use. https://github.com/openucx/ucc/wiki/UCC-Allreduce-example

nirandaperera avatar Mar 11 '22 16:03 nirandaperera

fyi, torch_ucc was moved to another repo https://github.com/facebookresearch/torch_ucc

Sergei-Lebedev avatar Mar 15 '22 07:03 Sergei-Lebedev

Thanks a lot for the pointer @Sergei-Lebedev

vibhatha avatar Mar 15 '22 07:03 vibhatha

@Sergei-Lebedev https://github.com/Sergei-Lebedev, on a side note, dist.new_group() with UCC might also benefit from this PR I've Pass group ranks and options to third party distributed backends by esaliya · Pull Request #73164 · pytorch/pytorch (github.com) https://github.com/pytorch/pytorch/pull/73164

This is to fix the missing subranks info from distributed_c10d.py to PyTorch 3rd party distributed backends. UCC is the only other 3rd party distributed backend I've seen so far, so if you can give some feedback, that'll be great.

Saliya

On Tue, Mar 15, 2022 at 12:48 AM Vibhatha Lakmal Abeykoon < @.***> wrote:

Thanks a lot for the pointer @Sergei-Lebedev https://github.com/Sergei-Lebedev

— Reply to this email directly, view it on GitHub https://github.com/cylondata/cylon/issues/575#issuecomment-1067668264, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGC3L5RAZQY5F2VRES4Q43VAA6FPANCNFSM5QD34HHQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Saliya Ekanayake, Ph.D Cloud Accelerated Systems & Technologies (CAST) Microsoft

esaliya avatar Mar 15 '22 18:03 esaliya

Hi @esaliya, subranks info might me useful, but I think it can be reconstructed using prefix store without adding any additional options to PG constructor. In UCC we don't need this info because UCC team allgather is used instead. The bigger challenge for us is that Pytorch world group is not strictly defined, for instance it's allowed to create default PG with backend A and then create subgroup with backend B (see example below). Because of this fact it's hard to utilize resource sharing within UCC and be fully compatible with Pytorch semantic

import os
import torch
import torch.distributed as dist
import torch_ucc

os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12345'
os.environ['RANK']        = os.environ['OMPI_COMM_WORLD_RANK']
os.environ['WORLD_SIZE']  = os.environ['OMPI_COMM_WORLD_SIZE']

dist.init_process_group('gloo')
sg = dist.new_group(ranks=[0, 1], backend='ucc')
if dist.get_rank() in [0, 1]:
  sg.barrier()
dist.barrier()

Sergei-Lebedev avatar Mar 16 '22 09:03 Sergei-Lebedev

@kaiyingshan following are the steps that needs to be done to build UCC.

  • install conda (miniconda would be the easiest)
  • create a conda env using conda/environments/cylon.yml (this will install ucx 1.12 to the environment)
  • Install UCC as follows
git clone --single-branch -b v1.0.0 https://github.com/openucx/ucc.git $HOME/ucc
cd $HOME/ucc
./autogen.sh
./configure --prefix=$HOME/ucc/install --with-ucx=$CONDA/envs/cylon_dev
make install
  • Build cylon with UCX and UCC
python build.py -cmake-flags="-DCYLON_UCX=1 -DCYLON_UCC=1 -DUCC_INSTALL_PREFIX=$HOME/ucc/install" -ipath="$HOME/cylon/install" --cpp --python --test

If you are running ucc_example.cpp locally, make sure to add conda libs and UCC libs to the LD_LIBRARY_PATH

nirandaperera avatar May 29 '22 17:05 nirandaperera

It seems like it fails to build nondeterministically on my computer, maybe it's because I'm using wsl.. I'll try to figure out the cause

kaiyingshan avatar May 31 '22 22:05 kaiyingshan