raft [FEA] Increase CAGRA's max top-k

Summarizing an offline discussion w/ @anaruse (Akira, please feel free to correct any mistakes):

In CAGRA's single-CTA algorithm, the internal top-k size is capped at 512 (and k must be <= internal top-k) so the upper limit of k in single-CTA is ~200-300. The internal top-k cap in single-CTA results from inceased shared memory and register usage as it increases.

If internal top-k >= 512, we fall back to multi-kernel, which doesn't limit the internal top-k, but the plan is to abandon the mult-kernel approach because we can use the multi-CTA algorithm instead. The challenge here is that we know we can split queries over multiple CTAs using the multi-CTA kernel when internal top-k is very large, but we don't yet know what types of recall we can achieve when k itself is very large (like 1k, for example).

Also cc @tfeher and @wphicks for visibility.

Aug 04 '23 16:08 cjnolet

Here is the performance of the top-1000 search using the multi-cta and multi-kernel modes. raft-top1000

For a fair comparison, I evaluated the performance with itopk=1024, 2048, and 4096, meaning that the total number of distance calculations is almost the same in both modes. As expected, the recall achieved in multi-cat mode is lower than in multi-kernel. The throughput is also lower, although I did not know what it would be until I measured it.

Supplements:

created a ground truth list for the query file for each dataset using the exact nearest neighbor search.
The current max top-k in the multi-kernel mode is limited to 1024 to reduce the library size and compile time. To measure the performance this time, I have enabled some template instantiations here. https://github.com/rapidsai/raft/blob/70475c2e699ea4fe42ace1f47919d5d976c3cb7e/cpp/include/raft/neighbors/detail/cagra/topk_for_cagra/topk_core.cuh#L897-L900 And commented out here. https://github.com/rapidsai/raft/blob/70475c2e699ea4fe42ace1f47919d5d976c3cb7e/cpp/include/raft/neighbors/detail/cagra/search_plan.cuh#L263-L269

Sep 01 '23 03:09 enp1s0

raft-top1000

This figure shows the performance of the multi-CTA kernel with different internal top-k sizes. In the current multi-CTA implementation, the internal top-k size is fixed at 32, but I changed it to 64 or 128 and measured the performance. https://github.com/rapidsai/raft/blob/28b789404bedfa8dd82675fc4221f6db927c0422/cpp/include/raft/neighbors/detail/cagra/search_multi_cta.cuh#L110-L112

By changing the itopk size, we can get performance almost compatible with the multi-kernel implementation in the GloVe, NYTimes, and SIFT datasets, while it is still slower for GIST.

I think it would be better to be able to change the internal top-k size in the multi-CTA implementation or fix it at 64 if the performance degradation is small enough in other top-N searches, e.g. N=10.

Sep 12 '23 09:09 enp1s0

Partially addressed by #2097.

Jan 23 '24 22:01 cjnolet

Tagging @mfoerste4 for visibility.

Apr 15 '24 21:04 tfeher