ROCm support
- add hipify at configure time
- ROCm specific code paths behind USE_ROCM guards
- support for wavefront 32 (Navi) and 64 (MI)
- use builtins to match inline PTX
- support C API on ROCm
- support Python API on ROCm
This superseedes #3126 and addresses #3231
As discussed in #3231 , @ItsPitt (Michael Pittard) and @iotamudelta (myself) can maintain the ROCm backend. AMD can provide two MI200-class servers for CI with a similar arrangement to the PyTorch CI.
Given the size of the work and difficulty to review that, we'd be more than happy to split it up into smaller PRs based on feedback what such PRs should encompass. Thanks!
Assuming this is WIP because the Python tests are not there yet. Otherwise I can write them but could take some time. Sorry for insisting but many people (especially in the ML community) know Faiss only via Python.
@mdouze thanks a lot for having a look! yes, we can figure out the Python support prior to getting it merged.
Are there any other immediate issues you see that we should be working on? Do you prefer getting this all merged as one big PR or split up (if so: how)?
Again - thanks a lot!
We are looking into compiling this in the CI @ramilbakhshyiev
@mdouze we sorted the Python support out. I am not sure if the Windows CI target failing after my latest merge of main is real or not.
@ramilbakhshyiev anything we can help with w.r.t. getting this into CI? Thanks!
@iotamudelta We will be trying this soon, our plan is to use a g4ad instance from AWS. Once we try it on that, we will most likely come back with some feedback.
@ramilbakhshyiev I'm not sure g4ad will work - according to https://aws.amazon.com/ec2/instance-types/ , it features a AMD Radeon Pro V520 w/ RDNA1 architecture which is not supported by ROCm (see: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html )
@iotamudelta Is there any plan to support it? I think AWS is a widely used platform and it only seems to support g4ad, right?
@ramilbakhshyiev I do not know about support plans. ROCm does support RDNA2 and RDNA3 GPUs.
Based on some of my very high-level research, it seems that some users still get ROCm on g4ad even though it's unsupported by AMD. Do you have any opinion on using that to build this in our CI? If we don't have an option to build and test on g4ad then I believe that our only option is to use CI only for builds.
@ramilbakhshyiev it's hard to answer this - building CI is must-have, I think. If we can get started with that, it'd be a huge step forward. We can try to test on the AWS instance but we cannot guarantee correctness, stability, or performance on it.
As stated earlier, we could also use two of the PyTorch CI MI200-class machines but it sounds that would be too much effort to integrate?
@iotamudelta Hey, apologies for missing that earlier offer, I missed that context when picking this up. I think getting AMD to provide backends would work great and I think we can support the integration. I have glanced through #3231 and PyTorch but did not figure out AMD servers register runners with GitHub. Any chance you can point me to that? I assume you're using PyTorch's GH App to retrieve API tokens and register runners to use.
@iotamudelta Hey, apologies for missing that earlier offer, I missed that context when picking this up. I think getting AMD to provide backends would work great and I think we can support the integration. I have glanced through #3231 and PyTorch but did not figure out AMD servers register runners with GitHub. Any chance you can point me to that? I assume you're using PyTorch's GH App to retrieve API tokens and register runners to use.
@iotamudelta @ItsPitt
@ramilbakhshyiev we'll check on our side how the integration works in the PyTorch CI - if you could check on yours as well to connect the dots?
In the meantime, getting even the build functionality done would allow us to make some progress with integrating the code support.
@iotamudelta Pinged folks on our end regarding the PyTorch setup.
Re: build only setup, tried building this PR and running into conda dependency issues: https://github.com/facebookresearch/faiss/actions/runs/9863975175/job/27237915065?pr=3622
Is installing hip complier from conda-forge the best practice or do you have your own repository that I should be using?
@iotamudelta I have learned how we can register the runners you provide. Just to confirm: would these machines be continuously running? Would we get SSH access to them to configure them or would you need to configure them yourself?
@ramilbakhshyiev concerning setup - we'd prefer the solution PyTorch employs in their docker containers where it pulls directly from repo.radeon.com and versions the docker image w/ our ROCm version.
Concerning the machines, they'd be continuously running (unless there are HW issues or servicing, ofc). Let me check about ssh access.
@iotamudelta All of FAISS builds happen directly on the machines. Do you know if there is a way to install it using conda? If not, if we start with a container image, do you envision any issues if we install the rest with conda as usual?
@ramilbakhshyiev using our repos for the ROCm stack and rest from conda should compose, that's how PyTorch is setting things up: https://github.com/pytorch/pytorch/blob/main/.ci/docker/ubuntu-rocm/Dockerfile
@iotamudelta I think this is the code that sets up ROCm inside containers: https://github.com/pytorch/pytorch/blob/main/.ci/docker/common/install_rocm.sh#L9. Since we use ephemeral build machines for now, I will look into starting without introducing containers and we start using containers when we build on the machines you provided because those would be long-running. Wdyt?
@iotamudelta And another question to confirm: for build-only setup, we do not need physical GPUs on the machines, right? That's the case with our Nvidia builds but I just wanted to confirm that this is true for ROCm before we try to build it on a runner that doesn't have AMD GPUs.
@ramilbakhshyiev sounds good - I've reached out to our DevOps team of these machines.
Concerning build: there's a workaround where we can set /opt/rocm/bin/target.lst with the targeted architecture(s) and have hipcc pick that up:
gfx900
gfx906
gfx908
gfx90a
gfx1030
gfx1100
as an example - for 200 we'd only need to target gfx90a (to limit finalization time). If we wanted to build a release wheel, we would extend the list as needed.
@iotamudelta I am working on the build-only configuration for now and running into issues. I'm using the PyTorch configuration as a start and adopted it to our CI builds. Here's the issues I am seeing: https://github.com/facebookresearch/faiss/actions/runs/9882731440/job/27296164928?pr=3622
Here's the step I added to install ROCm: https://github.com/facebookresearch/faiss/pull/3622/files#diff-c5dee0272afe1826fd7f0936f07026a8c97865bd18ae5234490d22c4b0800800R56-R107
Can you please take a look and see if you have any pointers for what I could try next? I saw similar issues for Ubuntu 22.04 when it was just released but I believe ROCm 6.1 is fully supported on 22.04 now.
@ramilbakhshyiev I have been using ROCm 6.1 on Ubuntu 22.04 for this effort. I have yet to run into any issues. You probably won't need some of the extra steps in that config. Here is the official install for ROCm 6.1.2: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html
It looks pretty close. You might just need to add the latest 6.1.2 version number and make a few changes. Here are the lines I use in my docker:
mkdir --parents --mode=0755 /etc/apt/keyrings
wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | gpg --dearmor | sudo tee /etc/apt/keyrings/rocm.gpg > /dev/null
echo 'deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/6.1.2 jammy main' | sudo tee /etc/apt/sources.list.d/rocm.list
apt update && apt install -y rocm-dev6.1.2 rocm-libs6.1.2
The miopen and amdgpu stuff can probably be removed. And the build target looks good. I hope this helps.
@ItsPitt Thank you! I think there is progress but it fails further down the line.
- Can you please check if the installation is sufficient or if we are missing something? latest build files
- Can you please see what the next set of errors is and how to tackle those? build output
@iotamudelta @ItsPitt I noticed hipify.sh file being introduced that creates the faiss/gpu-rocm subdir. Does it need to be called explicitly?
@ramilbakhshyiev Yes, it will need to be run before the cmake step. I run it from the top level faiss/ directory.
faiss/gpu/hipify.sh
This will make a hip version of the gpu directory. It will also run hipify on c_api for support.
For your flags, you should only need -DFAISS_ENABLE_GPU=ON. No need for -DFAISS_ENABLE_ROCM=ON.
@ItsPitt I introduced the ENABLE_ROCM to be explicit to follow current patterns in the library. Happy to discuss now or when we have a PR to enable ROCm in CI.
I did add a step to call hipify.sh, the HIP module error still stayed. Can you please take a look and let me know if you have any thoughts? https://github.com/facebookresearch/faiss/actions/runs/9911542730/job/27384449480?pr=3622