riscv-v-spec icon indicating copy to clipboard operation
riscv-v-spec copied to clipboard

There is no vrscatter instruction in the spec.

Open Zissi-Lei opened this issue 2 years ago • 4 comments

Hi, I'm reading the rvv-1.0 spec and found that there is a vrgather instruction but no corresponding vrscatter instrution. Is there another consideration? I want to known why, thanks for your time!

Zissi-Lei avatar Apr 24 '22 08:04 Zissi-Lei

Suppose you want to scatter data [A B C D] to destination indices [1 3 5 7], as follows:

index:  7 6 5 4 3 2 1 0
before: x x x x x x x x
after:  D x C x B x A x

To do so, we can use masked vrgather.vv, with the following input operands

data:  x x x x D C B A
mask:  1 0 1 0 1 0 1 0
index: 3 x 2 x 1 x 0 x

(Here, x means "do not care".) The challenge then becomes constructing these mask and (source) index vector operands from the destination indices ([1 3 5 7]).

For example, if the mask is available and the destination indices form an increasing sequence, like in this case, we can use viota.m to construct the corresponding source indices from the (source) mask vector, as discussed in the context of "vdecompress".

In other cases, I suspect it will be best to use an indexed store, perhaps preceded by a unit-stride store (for the "undisturbed" elements) followed by a unit-stride load. If using the memory system is not an option, then in the worst case you might end up constructing mask and index one element at a time, with slides and bit-twiddling; I haven't thought through the details.

nick-knight avatar Apr 26 '22 00:04 nick-knight

Hi

Suppose you want to scatter data [A B C D] to destination indices [1 3 5 7], as follows:

index:  7 6 5 4 3 2 1 0
before: x x x x x x x x
after:  D x C x B x A x

To do so, we can use masked vrgather.vv, with the following input operands

data:  x x x x D C B A
mask:  1 0 1 0 1 0 1 0
index: 3 x 2 x 1 x 0 x

(Here, x means "do not care".) The challenge then becomes constructing these mask and (source) index vector operands from the destination indices ([1 3 5 7]).

For example, if the mask is available and the destination indices form an increasing sequence, like in this case, we can use viota.m to construct the corresponding source indices from the (source) mask vector, as discussed in the context of "vdecompress".

In other cases, I suspect it will be best to use an indexed store, perhaps preceded by a unit-stride store (for the "undisturbed" elements) followed by a unit-stride load. If using the memory system is not an option, then in the worst case you might end up constructing mask and index one element at a time, with slides and bit-twiddling; I haven't thought through the details.

Hi @nick-knight I am curious about the performance. Doesn't doing unit-stride store and unit-stride load have higher overhead than use a vrgather? The unit-stride load/store needs to interact with memory twice. I thought which is a huge cost comparing to process it in registers

howjmay avatar Nov 16 '23 17:11 howjmay

“Performance” is a consequence of the implementation, not the interface (ISA). This repo only concerns the interface. Nowhere in this repo does it discuss “overhead”, “runtime”, “cycles”, etc. Your question should be directed at the hardware engineers who are implementing the vector processor you are targeting.

If you were on my engineering team, I’d tell you to implement both variants, benchmark them, and report back to me which one was better.

I hope this makes sense!

nick-knight avatar Nov 16 '23 18:11 nick-knight

Thank you. I was thinking whether there is a general guideline for efficient implementation, but as you said, is shouldn't be the topic in this repo.

howjmay avatar Nov 19 '23 00:11 howjmay