riscv-v-spec
riscv-v-spec copied to clipboard
Left shift with saturation
Current spec requires 4 vector instructions to implement a left shift with saturation.
# v0 is data
# v1 is shift
# a0 is vl
vsetvli x0, a0, e32, m1
vmv.v.i v2, 1
vsll.vv v3, v2, v1
vwmulsu.vv v4, v0, v3
vnclip.wv v5, v4, 0
However, with widening instructions, this method cannot work with SEW 64 for most hardwares. If we have a specified vector instruction like the following
vssll.vv vd, vs2, vs1, vm # vd[i] = clip(vs2[i], vs1[i])
vssll.vx vd, vs2, rs1, vm # vd[i] = clip(vs2[i], x[rs1])
vssll.vi vd, vs2, uimm, vm # vd[i] = clip(vs2[i], uimm)
vssllu.vv vd, vs2, vs1, vm # vd[i] = clip(vs2[i], vs1[i])
vssllu.vx vd, vs2, rs1, vm # vd[i] = clip(vs2[i], x[rs1])
vssllu.vi vd, vs2, uimm, vm # vd[i] = clip(vs2[i], uimm)
Only 1 instruction is needed.
vsetvli x0, a0, e32, m1
vssll.vv v5, v0, v1
@aswaterman What's the reason that the spec only has "vssrl" but no left shift with saturation?
We could use widening instruction and vnclip to do the work and the "vxsat" could be set correctly by "vnclip". But this approach is not work for SEW 64 input if the platform doesn't support SEW 128. It's hard to check and setup vxsat correctly.
My recommendation is to use a widening multiply and clip for the cases where SEW < ELEN. The overhead of multiply vs. shift is usually not a concern, since vector units will nearly always provide fully pipelined multipliers.
For the SEW=ELEN case, I think it's totally reasonable to use a multi-instruction sequence (compare against 2^N, perform left shift, and, using the comparison result as a mask, overwrite some elements with -1).
I think it would be a performance issue if SEW equals to ELEN. For example, with SEW and ELEN are 64, the code would be
# v1 is data
# v2 is shift
# a0 is vl
vsevli x0, a0, e64, m1
vsll.vv v3, v1, v2 # input do shift
li a1, 9223372036854775807 # INT64_Max
vmv.v.x v4, a1
vsra.vv v4, v4, v2 # INT64_Max / (2 ^ shift)
vmsgt.vv v0, v1, v4 # overflow if data > (INT64_Max / (2 ^ shift))
vmerge.vvm v3, v3, a1, v0 # INT64_MAX
li a1, -9223372036854775808 # INT64_Min
vmv.v.x v4, a1
vsra.vv v4, v4, v2 # INT64_Min / (2 ^ shift)
vmslt.vv v0, v1, v4 # overflow if data < (INT64_Min / (2 ^ shift))
vmerge.vvm v3, v3, a1, v0 # INT64_MIN
# v3 is result
Yeah, that's a fairly substantial implementation. Can you give more details how this shows up in applications?
A slight variation is to construct INT64_MIN
from INT64_MAX
using vnot
. This avoids a scalar instruction (or two?) and scalar-to-vector register movement, but consumes another vector register.
You can also do a similar pattern with vmul
and vmulh
, although that seems at least as costly. (This is just the multi-word arithmetic version of the vwmul
+vnclip
approach from the case SEW < ELEN.)
I was hoping for a trick involving vsmul
but I didn't see it.
I think @aswaterman's earlier comment referenced the unsigned case, which involves roughly half the code.
If I didn't screw it up, I was able to improve on the algorithm a bit (9 -> 6 vector instructions):
vsevli x0, a0, e64, m1
li t0, (1<<63) # -inf
vsll.vv v3, v1, v2
vsra.vv v4, v3, v2
vmsne.vv v0, v1, v4 # true if +/- overflow
vmerge.vxm v3, v3, t0, v0 # set to -inf if +/- overflow
vmsge.vi v0, v1, 0, v0.t # true if +overflow
vnot.v v3, v3, v0.t # set to +inf if +overflow
It also needs the additional instructions to check and setup the vxsat status. https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#38-vector-fixed-point-saturation-flag-vxsat
The right shift instruction could do the similar things in one instruction without the sew==elen problem. https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#134-vector-single-width-scaling-shift-instructions
There should be a later vector extension with greater support for fixed-point operations, and don't want to add more to vector spec before 1.0.