ucx icon indicating copy to clipboard operation
ucx copied to clipboard

UCS: adding multi-dimensional hash tables

Open alex--m opened this issue 3 years ago • 11 comments

alex--m avatar Oct 23 '21 13:10 alex--m

Can one of the admins verify this patch?

swx-jenkins4 avatar Oct 23 '21 13:10 swx-jenkins4

ok to test

shamisp avatar Oct 23 '21 17:10 shamisp

@alex--m any update on CLA ?

shamisp avatar Oct 23 '21 17:10 shamisp

@shamisp afraid not, and I don't expect it'll be resolved soon. I'm posting stuff I plan to upstream once it does - I just don't seem to have permissions to put the "CLA missing" label...

alex--m avatar Oct 23 '21 18:10 alex--m

how is it different from using regular khash with a custom key type that contains multiple values?

yosefe avatar Nov 14 '21 13:11 yosefe

how is it different from using regular khash with a custom key type that contains multiple values?

This is the result of some research, and the paper is still in progress, but the gist is the difference in iteration. A multi-dimensional hash-table would only access the keys matching the query vector in every dimension (Figure 1), whereas this implementation has a special way to iterate over neighboring vectors (Figure 2) in order to locate the nearest neighbor. I plan to use this as an advanced form of caching (in a separate commit).

khash ndim_khash

alex--m avatar Nov 14 '21 14:11 alex--m

IIUC, the expected lookup performance of the special multi-dim implementation should be better than default khash with a vector key?

yosefe avatar Nov 14 '21 14:11 yosefe

IIUC, the expected lookup performance of the special multi-dim implementation should be better than default khash with a vector key?

No, I'm afraid the lookup is not faster, just different (and in fact typically slower). For example, if both dimensions fall into bin #1 in the figures above, the default khash with vector keys will check V3 and V4, whereas this special 2D khash will check V3 - V8. The only advantage (and purpose) of this special multi-dimensional khash-based data-structure is a fast nearest-neighbor lookup.

alex--m avatar Nov 14 '21 15:11 alex--m

@alex--m Can you give a bit more details how it will be used ? AKA how we will benefit from nearest neighbor lookup speedup.

shamisp avatar Nov 14 '21 15:11 shamisp

@alex--m Can you give a bit more details how it will be used ? AKA how we will benefit from nearest neighbor lookup speedup.

Sure. The basic Idea is this: caches tend to be all-or-nothing matches, so if you're looking up, say, a past request, you only get identical past requests. In such case, the nearest-neighbor lookup allow you to find a "similar" past request and modify it, rather than creating a brand new request (which is presumably more expensive, otherwise this makes little sense). The speedup primarily comes from (a) re-using similar past objects rather than creating new ones, and secondarily from the reduced memory consumption which is the result of this recycling. Now this begs the question: what objects are so hard to create, justifying all this? one answer is QPs, but there are others.

alex--m avatar Nov 14 '21 15:11 alex--m

@alex--m Sounds like some cache approximation. Can you please give us a bit more specific what is the follow up patch and how it will be useful for UCX.

shamisp avatar Nov 14 '21 18:11 shamisp