raft
raft copied to clipboard
[FEA] Increase CAGRA's max top-k
Summarizing an offline discussion w/ @anaruse (Akira, please feel free to correct any mistakes):
In CAGRA's single-CTA algorithm, the internal top-k size is capped at 512 (and k
must be <= internal top-k) so the upper limit of k
in single-CTA is ~200-300. The internal top-k cap in single-CTA results from inceased shared memory and register usage as it increases.
If internal top-k >= 512, we fall back to multi-kernel, which doesn't limit the internal top-k, but the plan is to abandon the mult-kernel approach because we can use the multi-CTA algorithm instead. The challenge here is that we know we can split queries over multiple CTAs using the multi-CTA kernel when internal top-k is very large, but we don't yet know what types of recall we can achieve when k
itself is very large (like 1k, for example).
Also cc @tfeher and @wphicks for visibility.
Here is the performance of the top-1000 search using the multi-cta and multi-kernel modes.
For a fair comparison, I evaluated the performance with itopk=1024, 2048, and 4096, meaning that the total number of distance calculations is almost the same in both modes. As expected, the recall achieved in multi-cat mode is lower than in multi-kernel. The throughput is also lower, although I did not know what it would be until I measured it.
Supplements:
- created a ground truth list for the query file for each dataset using the exact nearest neighbor search.
- The current max top-k in the multi-kernel mode is limited to 1024 to reduce the library size and compile time. To measure the performance this time, I have enabled some template instantiations here. https://github.com/rapidsai/raft/blob/70475c2e699ea4fe42ace1f47919d5d976c3cb7e/cpp/include/raft/neighbors/detail/cagra/topk_for_cagra/topk_core.cuh#L897-L900 And commented out here. https://github.com/rapidsai/raft/blob/70475c2e699ea4fe42ace1f47919d5d976c3cb7e/cpp/include/raft/neighbors/detail/cagra/search_plan.cuh#L263-L269
This figure shows the performance of the multi-CTA kernel with different internal top-k sizes. In the current multi-CTA implementation, the internal top-k size is fixed at 32, but I changed it to 64 or 128 and measured the performance. https://github.com/rapidsai/raft/blob/28b789404bedfa8dd82675fc4221f6db927c0422/cpp/include/raft/neighbors/detail/cagra/search_multi_cta.cuh#L110-L112
By changing the itopk size, we can get performance almost compatible with the multi-kernel implementation in the GloVe, NYTimes, and SIFT datasets, while it is still slower for GIST.
I think it would be better to be able to change the internal top-k size in the multi-CTA implementation or fix it at 64 if the performance degradation is small enough in other top-N searches, e.g. N=10.
Partially addressed by #2097.
Tagging @mfoerste4 for visibility.