faiss icon indicating copy to clipboard operation
faiss copied to clipboard

Supporting ARM SVE, the newer extended vector instruction set for aarch64

Open vorj opened this issue 2 years ago • 5 comments

Summary

Dear @mdouze and all,

ARM SVE is a newer extended vector instruction set than NEON and is supported on CPUs like AWS Graviton3 and Fujitsu A64fx. I've added SVE support and some functions implemented with SVE to faiss, then compared their execution times. It seems that my implementation improves the performance on some environment. This is just first implementation to show the ability of SVE, and I plan to implemnent SVE version of other functions currently not ported to SVE.

It might be unable to check on Circle CI currently, however would you mind if I submit this as PR?

Platform

OS: Ubuntu 22.04

Faiss version: a3296f42adee7a0159b7ac09d7642e862edb142f, and mine

Installed from: compiled by myself

Faiss compilation options: cmake -B build -DFAISS_ENABLE_GPU=OFF -DPython_EXECUTABLE=$(which python3) -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=ON -DFAISS_OPT_LEVEL=sve ( -DFAISS_OPT_LEVEL=sve is new optlevel introduced by my changes)

Running on:

  • [x] CPU
  • [ ] GPU

Interface:

  • [ ] C++
  • [x] Python

Reproduction instructions

I only post the results to search SIFT1M. If you need more detailed information, please let me know.

benchmark result

  • Evaluated on an AWS EC2 c7g.large instance, run faiss on
  • original is the current (a3296f42adee7a0159b7ac09d7642e862edb142f) implementation
  • SVE is the result of my implementation supporting ARM SVE

image

The above image illustrates the ratio of speed up.

  • In the best case, SVE is approx. 2.26x faster than original (IndexIVFPQ + IndexHNSWFlat, M: 32 nprove: 16)
    • original : 0.618 ms
    • SVE : 0.274 ms

vorj avatar May 31 '23 01:05 vorj

Thanks for looking into this! Do I understand correctly that this is with a 512-bit SIMD width? Indeed we should have a way to integrate code for hardware that is not supported by CircleCI (AVX512 being the other example). So we welcome a PR for this functionality.

mdouze avatar May 31 '23 05:05 mdouze

@mdouze

Do I understand correctly that this is with a 512-bit SIMD width?

SVE is an abbreviation of Scalable Vector Extension . In this context, scalable means that the vector length is not fixed on the instruction set. The vector length is specified by each CPU, for example, A64fx has 512bit SVE register, but Graviton3 has 256bit SVE register. So programmer should write length-independent code, then the binary will work on each CPU with detecting a real vector length at run time. The length of SVE register becomes 128*n bits in the range of [128, 2048] bits.

So we welcome a PR for this functionality.

I'm glad to hear that! :smile: I will make the PRs later.

vorj avatar May 31 '23 05:05 vorj

@vorj thanks for your PR! I have couple questions, just in order to get some knowledge of SVE.

  1. Do I get it correct that if, say, SVE vector length is 512 bits, then it still will be possible to have evaluations for 256 bits and 128 bits, just like AVX-512 extends AVX2, which extends AVX?
  2. Do I get it right that the most speedup effect is related to faster distance computation (fvec_L2sqr_* and fvec_inner_product_* functions)?

Also, I'll be happy to assist and point you to the bottlenecks that would benefit from custom ARM code, if needed.

alexanderguzhva avatar May 31 '23 13:05 alexanderguzhva

@alexanderguzhva To answer your question,

  1. Let the vector length is 512bit and considering the 32bit element type (so the vector is used as 16 x 32bit). Below, I will represent the mask as {mask0, mask1, ..., mask15}. If you pass the mask as {1, 1, 1, 1, 0, 0, 0, 0, ..., 0} , you can load/calculate/store 4 x 32bit (=128bit) data. {1, 1, 1, 1, 1, 1, 1, 1, 0, 0, ..., 0} is for 256bit. Of course this should be slow rather than using full length. Another option, you can still use Advanced SIMD(NEON) for 128/64bit SIMD instruction set. When you use it, you must need to write peel loop or something like that for the data with non-4-multiple length in the same manner as before.
  2. At least in this PR, almost yes. I plan to make another PR, which contains SVE implementation of code_distance and exhaustive_L2sqr_blas .

Also, I'll be happy to assist and point you to the bottlenecks that would benefit from custom ARM code, if needed.

Thank you! 😄

vorj avatar May 31 '23 15:05 vorj

is supported on CPUs like AWS Graviton3 and Fujitsu A64fx

And now Microsoft Azure's Cobalt

kunalspathak avatar Apr 25 '25 23:04 kunalspathak