riscv-perf-model Add Support for Vector Permutation Instructions

Add vector sequencer and execution support for the vector permutation instructions. At least 1 test should be added with an instruction from each category to verify proper sequencing. Some of these instruction types require reading up to 8 source registers to write to a single destination, which is not reasonable for sequencing. Need to set a reasonable limit on the number of source registers allowed per uop and come up with a creative way to sequence these more complex instructions. The solution may require that the uops be executed in order, which is ok.

[x] Vector Permutation Instructions: Integer Scalar Move Instructions
[x] Vector Permutation Instructions: Floating-Point Scalar Move Instructions
[ ] Vector Permutation Instructions: Vector Slide Instructions
[ ] Vector Permutation Instructions: Vector Register Gather Instructions
[ ] Vector Permutation Instructions: Vector Compress Instruction
[x] Vector Permutation Instructions: Whole Vector Register Move

Sep 04 '24 17:09 kathlenemagnus-mips

I've hooked up all of these instructions to the vector uop generator, but I have not implemented sequencing for the "complex" permute instructions. The uop generation for these types of permutes is very dependent on the design of the permute execution unit. It would make a good intern project!

Mar 10 '25 22:03 kathlenemagnus

Hi @kathlenemagnus, I would like to know if I can come up with a design proposal and a test plan for the remaining Vector permutation instructions.

Mar 21 '25 06:03 govardhnn

Hi @kathlenemagnus, I would like to know if I can come up with a design proposal and a test plan for the remaining Vector permutation instructions.

Yes, that would be ok with me. We can discuss your proposal in an upcoming SIG meeting.

Mar 21 '25 15:03 kathlenemagnus

Hi @kathlenemagnus I discussed with Arup and he asked me to assign this issue to myself, but I am unable to assign it to myself (maybe because I am not following this project or something). Can you please assign this to me?

I have also come up with a following implementation for the permute instruction. In this example I have demonstrated vrgather but gather, compress, slideup/down should be implemented in similar manner for Olympia:

(I am unable to copy the image from below source but it explains basics of how all these 4 instructions work, I am not interested in their implementation for Olympia)

Source: https://arxiv.org/html/2505.07112v2 “Vgather: which input element is selected for each element of the output vector” “Vcompress:The mask identifies which elements in the input vector should be included on the right side of the output vector, without losing their original relative position.” Spec: https://github.com/riscvarchive/riscv-v-spec/blob/master/v-spec.adoc#sec-vector-permute

Now to specific implementation for Olympia:

The below implementation assumes the following:

“The vrgather instruction operates on active elements according to its indexing logic. The vta and vma flags ensure that software can be written to be agnostic of the physical contents of the inactive and tail portions of the destination vector registers, allowing implementations more flexibility in how they handle these elements (e.g., potentially using them for other purposes or leaving them in an undefined state).”

Below example is for lmul = 4 In the below example, since v4 is the destination based index register, while we need to use every source register to make up the destination value, index register remains constant for a particular destination register (v4 for v20, v5 for v21, v6 for v22 and v7 for v23)

Vrgather: (assuming the indexes are saved in v4,v5,v6,v7)
UOP 1:  vrgather.vv v20, v8,  v4 
UOP 2:  vrgather.vv v20, v9,  v4
UOP 3:  vrgather.vv v20, v10, v4
UOP 4:  vrgather.vv v20, v11, v4
UOP 5:  vrgather.vv v21, v8,  v5 
UOP 6:  vrgather.vv v21, v9,  v5
UOP 7:  vrgather.vv v21, v10, v5
UOP 8:  vrgather.vv v21, v11, v5      
UOP 9:  vrgather.vv v22, v8,  v6 
UOP 10: vrgather.vv v22, v9,  v6
UOP 11: vrgather.vv v22, v10, v6
UOP 12: vrgather.vv v22, v11, v6
UOP 13: vrgather.vv v23, v8,  v7 
UOP 14: vrgather.vv v23, v9,  v7
UOP 15: vrgather.vv v23, v10, v7
UOP 16: vrgather.vv v23, v11, v7

We dont need to use v20 as source register for above Uops 2 to 4 because based on vma and vta flags, each of those should update only the corresponding bytes based on index specified by v4 (like said above).

I would like your feedback on the above implementation.

Nov 23 '25 01:11 htanna-ventanamicro

Hi @htanna-ventanamicro, I apologize for my slow reply - I was out sick for about 10 days. Your proposal is great. It's functionally correct, which is the most important thing, but there is one problem I see.

First, we need to ask ourselves if Olympia should support partial register writes or only full register writes. Most modern OOO machines only support full register writes, because of the complexity of supporting partial writes for large register files. If we wanted to support partial register writes for the vector registers, the vector register file would need to support writes to each element of a vector register (8 to 64 bits depending on the SEW) which adds a lot of wires and could add extra cycles of latency to writeback.

Your solution requires partial register writes to be supported. In your example, the first uop may need to write an element from v8 into v20 depending on the index values in v4. To support full register writes, you will just need to add v20 as a source. So the value written to the v20 destination will come from any elements read from v8 combined with the original v20 source.

UOP 1:  vrgather.vv v20, v8,  v4, v20

What I like about what you've proposed is that every uop is completely independent. This means that the permutation unit itself will be very simple and we can afford to have it in multiple execution pipelines. I was thinking we should try to reduce the number of uops required, but I don't think that is feasible with this design. With the change to support full register writes, each uop here has 3 sources and 1 destination. If masking is enabled, that's another source register that has to be added. The number of sources and destinations can determine how many read and writes ports to the register file an execution needs to support. I think 4 read ports is already a lot, and if we were to add any more destinations, we would also need to add another source register for the uops to remain independent. If you want to try it, you could add a parameter to increase the number of sources and destinations allowed to reduce the uop count. With 5 sources and 2 destination you can reduce the number of uops required by half.

It would be nice if you could put together some documentation/diagrams that show the thoughput that can be achieved with your design. At first glance, the large number of uops may seem like it would be bad for performance but if the execution latency of each uop is low (1-2 cycles?) and we can support the permutation instructions in multiple execution pipelines, I think the end-to-end latency for the entire instruction will be very reasonable.

Dec 02 '25 19:12 kathlenemagnus

@kathlenemagnus thanks for your response, it is very helpful.

I am going to reiterate somethings that you said in your response and ask questions for other things:

Partial vs full write case - response (not a question) Ya we discussed this briefly about adding v20 as a source but was unsure if it was ok to implement, and I think I saw some uops with 3 sources (besides mask) after I posted for some other vector instructions already implemented, so I am onboard with the idea and thanks for explaining, so the above structure will change as follows:

Vrgather: (assuming the indexes are saved in v4,v5,v6,v7)
UOP 1:  vrgather.vv v20, v8,  v4, v20 
UOP 2:  vrgather.vv v20, v9,  v4, v20 
UOP 3:  vrgather.vv v20, v10, v4, v20 
UOP 4:  vrgather.vv v20, v11, v4, v20 
UOP 5:  vrgather.vv v21, v8,  v5, v21 
UOP 6:  vrgather.vv v21, v9,  v5, v21
UOP 7:  vrgather.vv v21, v10, v5, v21
UOP 8:  vrgather.vv v21, v11, v5, v21     
UOP 9:  vrgather.vv v22, v8,  v6, v22
UOP 10: vrgather.vv v22, v9,  v6, v22
UOP 11: vrgather.vv v22, v10, v6, v22
UOP 12: vrgather.vv v22, v11, v6, v22
UOP 13: vrgather.vv v23, v8,  v7, v23
UOP 14: vrgather.vv v23, v9,  v7, v23
UOP 15: vrgather.vv v23, v10, v7, v23
UOP 16: vrgather.vv v23, v11, v7, v23

Every Uop is completely independent (Question): So if we have 16 Uops like mentioned above in point#1, and if we divide them in 4 different groups based on destination, then these 4 groups are independent of each other but with each group, there are uops which have to follow real data dependency.

To elaborate, Uop1-Uop4 have data dependency because of v20 as one of the sources as well, this constitute as 1 group and will have to wait for v20 to be ready after previous write(even in case when Reg-Renaming is enabled for vector registers). This group will be independent of the group Uop5-Uop8 and they both can be executed in parallel if we have 2 vector execution pipelines.

Question here is that each Uop isn't completely independent but group of Uops are, if I understood it correctly?

Latency diagrams (comment and question) I would briefly describe the latency for uops based on number of execution units, for example if we have 2 execution units for vector, if we consider 4 sources including mask and so each uop has an execution latency of 2 cycles, each group will take 8 cycles, so in total it will take 32/2 = 16 cycles to execute the instruction in case of lmul4. If we increase the execution units to 4, the latency of each uop is still 2 cycle and we have to execute each group with 8 cycles but now we can drop the total execution to 32/4 = 8 cycles. (Groups are as described in above point#2). 3a. can you please let me know how detailed latency diagram do you want? Do you have any examples? 3b. Is it ok if I start implementing what we discussed in point#1 and will come up with latency diagrams along the way because we kind of need these permute instructions 3c. I think adding a 5th source might get a little complicated, I would chalk it upto as an enhancement which can be done at later time, I don't mind working on it once the basic implementation is done, is that ok?

Dec 02 '25 23:12 htanna-ventanamicro

I found one of the vector latency diagrams (I guess?) from original implementation: https://github.com/riscv-software-src/riscv-perf-model/discussions/89#discussioncomment-9324963

is this what you are looking for?

Dec 03 '25 00:12 htanna-ventanamicro