faiss ROCm support

add hipify at configure time
ROCm specific code paths behind USE_ROCM guards
support for wavefront 32 (Navi) and 64 (MI)
use builtins to match inline PTX
support C API on ROCm
support Python API on ROCm

May 20 '24 13:05 iotamudelta

This superseedes #3126 and addresses #3231

As discussed in #3231 , @ItsPitt (Michael Pittard) and @iotamudelta (myself) can maintain the ROCm backend. AMD can provide two MI200-class servers for CI with a similar arrangement to the PyTorch CI.

May 20 '24 13:05 iotamudelta

Given the size of the work and difficulty to review that, we'd be more than happy to split it up into smaller PRs based on feedback what such PRs should encompass. Thanks!

May 20 '24 15:05 iotamudelta

Assuming this is WIP because the Python tests are not there yet. Otherwise I can write them but could take some time. Sorry for insisting but many people (especially in the ML community) know Faiss only via Python.

May 27 '24 07:05 mdouze

@mdouze thanks a lot for having a look! yes, we can figure out the Python support prior to getting it merged.

Are there any other immediate issues you see that we should be working on? Do you prefer getting this all merged as one big PR or split up (if so: how)?

Again - thanks a lot!

May 28 '24 14:05 iotamudelta

We are looking into compiling this in the CI @ramilbakhshyiev

May 28 '24 16:05 mdouze

@mdouze we sorted the Python support out. I am not sure if the Windows CI target failing after my latest merge of main is real or not.

@ramilbakhshyiev anything we can help with w.r.t. getting this into CI? Thanks!

Jun 14 '24 01:06 iotamudelta

@iotamudelta We will be trying this soon, our plan is to use a g4ad instance from AWS. Once we try it on that, we will most likely come back with some feedback.

Jun 21 '24 22:06 ramilbakhshyiev

@ramilbakhshyiev I'm not sure g4ad will work - according to https://aws.amazon.com/ec2/instance-types/ , it features a AMD Radeon Pro V520 w/ RDNA1 architecture which is not supported by ROCm (see: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html )

Jun 26 '24 16:06 iotamudelta

@iotamudelta Is there any plan to support it? I think AWS is a widely used platform and it only seems to support g4ad, right?

Jun 27 '24 04:06 ramilbakhshyiev

@ramilbakhshyiev I do not know about support plans. ROCm does support RDNA2 and RDNA3 GPUs.

Jun 27 '24 14:06 iotamudelta

Based on some of my very high-level research, it seems that some users still get ROCm on g4ad even though it's unsupported by AMD. Do you have any opinion on using that to build this in our CI? If we don't have an option to build and test on g4ad then I believe that our only option is to use CI only for builds.

Jun 27 '24 17:06 ramilbakhshyiev

@ramilbakhshyiev it's hard to answer this - building CI is must-have, I think. If we can get started with that, it'd be a huge step forward. We can try to test on the AWS instance but we cannot guarantee correctness, stability, or performance on it.

As stated earlier, we could also use two of the PyTorch CI MI200-class machines but it sounds that would be too much effort to integrate?

Jun 28 '24 13:06 iotamudelta

@iotamudelta Hey, apologies for missing that earlier offer, I missed that context when picking this up. I think getting AMD to provide backends would work great and I think we can support the integration. I have glanced through #3231 and PyTorch but did not figure out AMD servers register runners with GitHub. Any chance you can point me to that? I assume you're using PyTorch's GH App to retrieve API tokens and register runners to use.

Jul 02 '24 01:07 ramilbakhshyiev

@iotamudelta Hey, apologies for missing that earlier offer, I missed that context when picking this up. I think getting AMD to provide backends would work great and I think we can support the integration. I have glanced through #3231 and PyTorch but did not figure out AMD servers register runners with GitHub. Any chance you can point me to that? I assume you're using PyTorch's GH App to retrieve API tokens and register runners to use.

@iotamudelta @ItsPitt

Jul 08 '24 22:07 ramilbakhshyiev

@ramilbakhshyiev we'll check on our side how the integration works in the PyTorch CI - if you could check on yours as well to connect the dots?

In the meantime, getting even the build functionality done would allow us to make some progress with integrating the code support.

Jul 09 '24 18:07 iotamudelta

@iotamudelta Pinged folks on our end regarding the PyTorch setup.

Re: build only setup, tried building this PR and running into conda dependency issues: https://github.com/facebookresearch/faiss/actions/runs/9863975175/job/27237915065?pr=3622

Is installing hip complier from conda-forge the best practice or do you have your own repository that I should be using?

Jul 09 '24 20:07 ramilbakhshyiev

@iotamudelta I have learned how we can register the runners you provide. Just to confirm: would these machines be continuously running? Would we get SSH access to them to configure them or would you need to configure them yourself?

Jul 10 '24 18:07 ramilbakhshyiev

@ramilbakhshyiev concerning setup - we'd prefer the solution PyTorch employs in their docker containers where it pulls directly from repo.radeon.com and versions the docker image w/ our ROCm version.

Concerning the machines, they'd be continuously running (unless there are HW issues or servicing, ofc). Let me check about ssh access.

Jul 10 '24 18:07 iotamudelta

@iotamudelta All of FAISS builds happen directly on the machines. Do you know if there is a way to install it using conda? If not, if we start with a container image, do you envision any issues if we install the rest with conda as usual?

Jul 10 '24 18:07 ramilbakhshyiev

@ramilbakhshyiev using our repos for the ROCm stack and rest from conda should compose, that's how PyTorch is setting things up: https://github.com/pytorch/pytorch/blob/main/.ci/docker/ubuntu-rocm/Dockerfile

Jul 10 '24 19:07 iotamudelta

@iotamudelta I think this is the code that sets up ROCm inside containers: https://github.com/pytorch/pytorch/blob/main/.ci/docker/common/install_rocm.sh#L9. Since we use ephemeral build machines for now, I will look into starting without introducing containers and we start using containers when we build on the machines you provided because those would be long-running. Wdyt?

Jul 10 '24 19:07 ramilbakhshyiev

@iotamudelta And another question to confirm: for build-only setup, we do not need physical GPUs on the machines, right? That's the case with our Nvidia builds but I just wanted to confirm that this is true for ROCm before we try to build it on a runner that doesn't have AMD GPUs.

Jul 10 '24 19:07 ramilbakhshyiev

@ramilbakhshyiev sounds good - I've reached out to our DevOps team of these machines.

Concerning build: there's a workaround where we can set /opt/rocm/bin/target.lst with the targeted architecture(s) and have hipcc pick that up:

gfx900
gfx906
gfx908
gfx90a
gfx1030
gfx1100

as an example - for 200 we'd only need to target gfx90a (to limit finalization time). If we wanted to build a release wheel, we would extend the list as needed.

Jul 10 '24 20:07 iotamudelta

@iotamudelta I am working on the build-only configuration for now and running into issues. I'm using the PyTorch configuration as a start and adopted it to our CI builds. Here's the issues I am seeing: https://github.com/facebookresearch/faiss/actions/runs/9882731440/job/27296164928?pr=3622

Here's the step I added to install ROCm: https://github.com/facebookresearch/faiss/pull/3622/files#diff-c5dee0272afe1826fd7f0936f07026a8c97865bd18ae5234490d22c4b0800800R56-R107

Can you please take a look and see if you have any pointers for what I could try next? I saw similar issues for Ubuntu 22.04 when it was just released but I believe ROCm 6.1 is fully supported on 22.04 now.

Jul 10 '24 23:07 ramilbakhshyiev

@ramilbakhshyiev I have been using ROCm 6.1 on Ubuntu 22.04 for this effort. I have yet to run into any issues. You probably won't need some of the extra steps in that config. Here is the official install for ROCm 6.1.2: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html

It looks pretty close. You might just need to add the latest 6.1.2 version number and make a few changes. Here are the lines I use in my docker: mkdir --parents --mode=0755 /etc/apt/keyrings wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | gpg --dearmor | sudo tee /etc/apt/keyrings/rocm.gpg > /dev/null echo 'deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/6.1.2 jammy main' | sudo tee /etc/apt/sources.list.d/rocm.list apt update && apt install -y rocm-dev6.1.2 rocm-libs6.1.2

The miopen and amdgpu stuff can probably be removed. And the build target looks good. I hope this helps.

Jul 11 '24 23:07 ItsPitt

@ItsPitt Thank you! I think there is progress but it fails further down the line.

Can you please check if the installation is sufficient or if we are missing something? latest build files
Can you please see what the next set of errors is and how to tackle those? build output

Jul 12 '24 06:07 ramilbakhshyiev

@iotamudelta @ItsPitt I noticed hipify.sh file being introduced that creates the faiss/gpu-rocm subdir. Does it need to be called explicitly?

Jul 12 '24 16:07 ramilbakhshyiev

@ramilbakhshyiev Yes, it will need to be run before the cmake step. I run it from the top level faiss/ directory.

faiss/gpu/hipify.sh

This will make a hip version of the gpu directory. It will also run hipify on c_api for support.

Jul 12 '24 17:07 ItsPitt

For your flags, you should only need -DFAISS_ENABLE_GPU=ON. No need for -DFAISS_ENABLE_ROCM=ON.

Jul 12 '24 17:07 ItsPitt

@ItsPitt I introduced the ENABLE_ROCM to be explicit to follow current patterns in the library. Happy to discuss now or when we have a PR to enable ROCm in CI.

I did add a step to call hipify.sh, the HIP module error still stayed. Can you please take a look and let me know if you have any thoughts? https://github.com/facebookresearch/faiss/actions/runs/9911542730/job/27384449480?pr=3622

Jul 12 '24 17:07 ramilbakhshyiev