heat icon indicating copy to clipboard operation
heat copied to clipboard

Address distributed non-ordered indexing

Open ClaudiaComito opened this issue 3 years ago • 3 comments

Related All kind of stuff depends on distributed non-ordered indexing. Here a sample of issues/PRs where the problem has come up in various forms:

#607 #760 #903 #703 #824 #902 #177 #621 #749 #271 #857

Feature functionality

We want to be able to index a distributed DNDarray with a distributed, non-ordered key, and return the correct, stable result. Examples below. An implementation of this functionality via Alltoallv is available in ht.sort(), needs to be generalized.

>>> a = ht.arange(50, split=0)
>>> a
DNDarray([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
          27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], dtype=ht.int32, device=cpu:0, split=0)
>>> b = ht.random.randint(0,50,(20,), dtype=ht.int64, split=0)
>>> b
DNDarray([46,  5, 44,  8, 14, 10, 30, 15, 34, 30, 41, 44, 28, 26, 11, 20, 16,  7,  9,  8], dtype=ht.int64, device=cpu:0, split=0)
>>> c = a[b]
>>> c
DNDarray([46,  5, 44,  8, 14, 10, 30, 15, 34, 30, 41, 44, 28, 26, 11, 20, 16,  7,  9,  8], dtype=ht.int32, device=cpu:0, split=0)

In the current implementation, c = a[b] returns a distributed DNDarray populated by whichever subset of the key is process-local.

On 2 processes:

c =  DNDarray([ 5,  8, 14, 10, 15, 11, 20, 16,  7,  9,  8, 46, 44, 30, 34, 30, 41, 44, 28, 26], dtype=ht.int32, device=cpu:0, split=0)

On 3 processes:

c =  DNDarray([ 5,  8, 14, 10, 15, 11, 16,  7,  9,  8, 30, 30, 28, 26, 20, 46, 44, 34, 41, 44], dtype=ht.int32, device=cpu:0, split=0)

ClaudiaComito avatar Feb 08 '22 10:02 ClaudiaComito

Stil relevant: #938 is a currently active PR that will resolve this issue.

Reviewed within #1109.

mrfh92 avatar Aug 17 '23 18:08 mrfh92