faiss icon indicating copy to clipboard operation
faiss copied to clipboard

Enable Intel®-AMX/oneDNN to accelerate IndexFlatIP search

Open guangzegu opened this issue 1 year ago • 33 comments

Description

Intel® AMX, which is an AI acceleration engine deeply embedded into every core of our 4th/5th Gen Intel® Xeon® Scalable processor. Intel® AMX(Intel Advanced Matrix Extensions) is a set of programming extensions designed to enhance the performance of matrix operations. Intel oneAPI Deep Neural Network Library (oneDNN) is an open-source performance library designed to accelerate deep learning frameworks on Intel architectures. oneDNN is able to leverage the efficient matrix computation extensions provided by AMX to accelerate the performance of deep learning frameworks on Intel architectures, especially for computation-intensive matrix operations.

IndexFlatIP search performance accelerated by oneDNN/AMX improves by 1.7X to 5X compared to the default inner_product, In scenarios with 1 query, dimensions ranging from 64 to 1024, and 1,000,000 vectors.

IndexFlatIP search performance accelerated by oneDNN/AMX improves by up to 4X compared to the Blas inner_product, In scenarios with 1000 query, dimensions ranging from 64 to 1024, and 1,000,000 vectors.

How to use

When invoking Cmake , add an option as follows:

  • -DFAISS_ENABLE_DNNL=OFF Enable support for oneDNN to accelerate IndexFlatIP search(possible values are ON and OFF)

When you want to use Intel®-AMX/oneDNN to accelerate the search of indexFlatIP, set FAISS_ENABLE_DNNL to ON and run on 4th/5th Gen Intel® Xeon® Scalable processor, the exhaustive_inner_product_seq method will be accelerated.

Co-authored-by: @xtangxtang [email protected]

guangzegu avatar Feb 27 '24 15:02 guangzegu

Hi @guangzegu!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

facebook-github-bot avatar Feb 27 '24 15:02 facebook-github-bot

@guangzegu this patch is in extremely early stage.

  1. there needs to be a description in the readme.txt file about how to set up oneAPI properly. For example, I needed to install dnnl, mkl and tbb, and then run source setvars.sh from oneAPI root directory. Imagine that someone sets this up on a fresh machine or in a docker container.
  2. it needs to be mentioned on how to set up DNNL_LIB in cmake arguments.
  3. a unit test tests to be added that activates the execution path that you've added. Basically, exhaustive search for IP has many if-then-else internal conditions and execution branches for various use cases (topk=1, topk=many, many query samples, few query samples, etc). The effect of your patch needs to be measured in milliseconds.
  4. I tried invoke the needed path, and whenever I invoke your code on AWS M7i machine (Intel Xeon 4th gen), I see the exception with the test could not create a primitive descriptor for an inner product forward propagation primitive. It is completely unclear about what goes wrong. amx_bf16 capability is enabled, which is seen in cat /proc/cpuinfo

Thanks

@mdouze Is Intel Xeon 4th gen available for CI?

alexanderguzhva avatar Feb 29 '24 00:02 alexanderguzhva

@guangzegu this patch is in extremely early stage.

  1. there needs to be a description in the readme.txt file about how to set up oneAPI properly. For example, I needed to install dnnl, mkl and tbb, and then run source setvars.sh from oneAPI root directory. Imagine that someone sets this up on a fresh machine or in a docker container.
  2. it needs to be mentioned on how to set up DNNL_LIB in cmake arguments.
  3. a unit test tests to be added that activates the execution path that you've added. Basically, exhaustive search for IP has many if-then-else internal conditions and execution branches for various use cases (topk=1, topk=many, many query samples, few query samples, etc). The effect of your patch needs to be measured in milliseconds.
  4. I tried invoke the needed path, and whenever I invoke your code on AWS M7i machine (Intel Xeon 4th gen), I see the exception with the test could not create a primitive descriptor for an inner product forward propagation primitive. It is completely unclear about what goes wrong. amx_bf16 capability is enabled, which is seen in cat /proc/cpuinfo

Thanks

@mdouze Is Intel Xeon 4th gen available for CI?

@alexanderguzhva Thank you very much for your comments.

  1. I will add a description in the readme.txt file on configuring oneDNN to enable this feature, indeed, the addition of unit tests needs to be carefully considered.
  2. I didn't run into this error could not create a primitive descriptor for an inner product forward propagation primitive in my environment before, I didn't set the environment variables using oneAPI, but simply installed oneDNN additionally under the community version. You can try referring to this link: https://oneapi-src.github.io/oneDNN/dev_guide_build.html. The version is v3.3+.

guangzegu avatar Mar 05 '24 14:03 guangzegu

@alexanderguzhva

  1. I suppose the unit tests for this PR can be covered by faiss/tests/test_index.py.
  2. The new commits have added some installation descriptions and have also enhanced the performance.
  3. You might try again with the latest changes. If there are any issues, please feel free to contact me .

guangzegu avatar Mar 26 '24 14:03 guangzegu

@guangzegu Thanks, I'll take a look

alexanderguzhva avatar Apr 06 '24 00:04 alexanderguzhva

@guangzegu Thanks, I'll take a look

Great! How's it going? Have you run into any issues?

guangzegu avatar May 14 '24 05:05 guangzegu

@guangzegu Hi, it is stil in my plans, together with https://github.com/zilliztech/knowhere/pull/535 . Sorry that it is taking too long, I get constantly distracted :(

alexanderguzhva avatar May 24 '24 21:05 alexanderguzhva

We are looking into compiling this in the CI @ramilbakhshyiev

mdouze avatar May 28 '24 16:05 mdouze

@guangzegu Could you please rebase this? We can try a test CI build next and go from there. Thanks!

ramilbakhshyiev avatar May 30 '24 19:05 ramilbakhshyiev

@ramilbakhshyiev Sure, I will rebase it. Thanks!

guangzegu avatar Jun 19 '24 01:06 guangzegu

@alexanderguzhva No worries, I understand. Thank you for the update and for your efforts!

guangzegu avatar Jun 19 '24 01:06 guangzegu

Thanks @guangzegu! We will be trying this out soon.

ramilbakhshyiev avatar Jun 21 '24 22:06 ramilbakhshyiev

Hi @guangzegu and @ramilbakhshyiev I'm trying to build this PR on the github CI :)

@guangzegu I'm following the documentation you have provided in README to set this up in: https://oneapi-src.github.io/oneDNN/dev_guide_build.html It looks like the official doc does not point to a conda installation (this is how FAISS normally installs dependencies, folks please correct me if I'm wrong here). I'm able to find dnnl on conda and ended up setting it up like below (can you clarify if this is the right way to install the dependency and if so, update the README?)

conda install -y -q conda-forge::onednn

If that is the case, I managed to get everything to build on CI but we have a C++ unit test failing, complaining about memleak (see build log) Is this something you can reproduce locally and expect? The actual test case source code is here

mengdilin avatar Jul 06 '24 02:07 mengdilin

@mengdilin Thank you for verifying this PR and uncovering potential issues :smile:, I'm going to try to reproduce this issue in my environment.

guangzegu avatar Jul 16 '24 13:07 guangzegu

Hi @guangzegu After combing through the PR, I'm not seeing anything obvious that would cause the memory leak (besides my nit comment), but obviously I will defer to you on the dnnl memory management aspect. I ended up running the failing mem_leak test through valgrind (diffing test result from master commit vs your PR) and it looks like your PR did not introduce any new leak (valgrind produced consistent analysis between your PR and the master commit).

We will look into the possibility of disabling this test or omit it from dnnl build to unblock merging your PR

mengdilin avatar Jul 19 '24 22:07 mengdilin

@guangzegu After omitting the memory leak test from your PR, it looks like we have encountered precision issue in several unit tests when it comes to inner product computation. Is this something expected?

A source for one of the failing tests is https://github.com/facebookresearch/faiss/blob/34bbe5e540bbc4edd0de38cb98bf0a563b2bae45/tests/test_residual_quantizer.py#L694

The test failure stacktrace looks like

args = (<function assert_array_almost_equal.<locals>.compare at 0x741cf7102480>, array([[12.644228 , 12.541752 , 11.607426 , ...03604 ],
       [12.91586  , 12.849993 , 12.578976 , ..., 11.806257 , 11.71474  ,
        11.699309 ]], dtype=float32))
kwds = {'err_msg': '', 'header': 'Arrays are not almost equal to 5 decimals', 'precision': 5, 'verbose': True}
    @wraps(func)
    def inner(*args, **kwds):
        with self._recreate_cm():
>           return func(*args, **kwds)
E           AssertionError: 
E           Arrays are not almost equal to 5 decimals
E           
E           Mismatched elements: 1226 / 1230 (99.7%)
E           Max absolute difference: 0.02308941
E           Max relative difference: 0.00268776
E            x: array([[12.64423, 12.54175, 11.60743, ..., 10.98963, 10.9623 , 10.89734],
E                  [ 6.23966,  6.20934,  6.11219, ...,  5.85792,  5.76734,  5.7244 ],
E                  [12.55453, 12.26167, 12.1587 , ..., 11.59533, 11.56127, 11.4444 ],...
E            y: array([[12.64321, 12.55013, 11.60776, ..., 10.98861, 10.96813, 10.89554],

You can reproduce the failure on your PR by cloning this PR https://github.com/facebookresearch/faiss/pull/3615 and run the following after coming faiss with DNNL mode on:

cd build/faiss/python && path/to/bin/python setup.py install && pytest --junitxml=test-results/pytest/results.xml tests/test_*.py

mengdilin avatar Jul 29 '24 16:07 mengdilin

@asadoughi pointed out that it looks like this PR is trading off precision for speed from https://github.com/facebookresearch/faiss/pull/3266/files#diff-9228cbbdef764c34694b0b5d637c05058ccc6c6b3279469a1b3421633e7feb3fR57

If that is the case, can you provide some tests covering the low precision scenario. We can gate these tests behind an explicit flag

mengdilin avatar Jul 29 '24 17:07 mengdilin

Hi @guangzegu and @xtangxtang what is the status of this PR? ? Let me know if you are blocked on anything :)

mengdilin avatar Aug 19 '24 23:08 mengdilin

Hi @guangzegu and @xtangxtang what is the status of this PR? ? Let me know if you are blocked on anything :) @mengdilin Sorry, I took some time off due to family matters. Now we will follow your suggestions to make some adjustments first and then ask for your help to review :).

guangzegu avatar Aug 21 '24 02:08 guangzegu