OSACA icon indicating copy to clipboard operation
OSACA copied to clipboard

Properly track the dependencies of the LOAD phase of instructions

Open pleroy opened this issue 11 months ago • 2 comments

Consider the following code snippet on Zen3, in the Intel syntax:

	shl	rax, 5
	subsd	xmm10, QWORD PTR [rax+r8]
	movsd	xmm7, QWORD PTR [rax+r8+16]

Prior to this PR, OSACA would generate a LOAD node to represent the memory reference of the subsd instruction, but would (incorrectly) assume that the LOAD was independent from every other instruction. This resulted in a graph like this:

bad_load

Note how the LOAD looks like it can be executed on the first day of creation, when in reality it depends on the computation of rax. Another way to look at it is that the subtraction above should not differ much from a sequence of movsd to load the memory location followed by subsd between registers.

With this PR the LOAD is correctly depending on the result of the shl, with significant changes to the critical path:

good_load

Concretely, when checking if an instruction reads a register we distinguish "normal" reads from reads that occur as part of evaluating a memory address. The latter is marked with the attribute "for_load" to indicate that its LOAD microinstruction is the part doing the read. This results in a dependency being added from the producer of the register to the LOAD node.

I am also encapsulating a bit the computations of the "virtual" line number of the LOAD to make the code easier to follow. This in turn entails minor changes to the structure of the code that displays the LOAD nodes.

NOTE:

This change reveals another issue, this one with the architecture definition files. For Cascade Lake, there is a definition for subsd reading from memory thus:

- name: SUBSD
  operands:
  - class: memory
    base: '*'
    offset: '*'
    index: '*'
    scale: '*'
  - class: register
    name: xmm
  latency: 4
  port_pressure: [[1, '015'], [1, '23'], [1, [2D, 3D]]]
  throughput: 0.5
  uops: 2                

While the pressure on ports 2 and 3 is correct (cf. Agner Fog), the latency is possibly incorrect in the case where the load cannot be executed early enough. In the example above, the latency should really be 8 cycles. For this reason, it seems preferable to never put entries for reading from memory in the tables, and always make the LOAD microoperation explicit.

Since revisiting the architecture definition files is out of scope for this PR, I am changing some tests to use Skylake, which doesn't have the problem.

pleroy avatar Mar 31 '25 21:03 pleroy

Hi @pleroy,

I am able to reproduce the case and I think indeed this is a bug and should be fixed. I also agree that we should only have "pure" load/stores in the DB, this is a flaw based on how this tool grew and in the beginning we did not handle these kinds of dependencies, but this is a different story.

I adjusted your test to create a LCD:

shl         rax, 5
subsd       xmm10, QWORD PTR [rax+r8]
cvtsd2si    rax, xmm10
movsd       xmm7, QWORD PTR [rax+r8+16]

Both with your PR and in the current version of main, the analysis looks like this (with --arch skx):

Combined Analysis Report
------------------------
                                     Port pressure in cycles
     |  0   - 0DV  |  1   |  2   -  2D  |  3   -  3D  |  4   |  5   |  6   |  7   ||  CP  | LCD  |
--------------------------------------------------------------------------------------------------
   1 | 0.00        |      |             |             |      |      | 1.00 |      ||      |  1.0 |   shl         rax, 5
   2 | 0.00        | 0.50 | 0.50   0.50 | 0.50   0.50 |      | 0.50 |      |      ||  4.0 |  4.0 |   subsd     xmm10, QWORD PTR [rax+r8]
   3 | 1.00        | 0.49 |             |             |      | 0.50 |      |      ||  7.0 |  7.0 |   cvtsd2si    rax, xmm10
   4 |             |      | 0.50   0.50 | 0.50   0.50 |      |      |      |      ||  4.0 |      |   movsd     xmm7, QWORD PTR [rax+r8+16]

       1.00          1.00   1.00   1.00   1.00   1.00          1.00   1.00             15   12.0




Loop-Carried Dependencies Analysis Report
-----------------------------------------
   1 | 12.0 | shl               rax, 5                  | [1, 2, 3]
   2 |  4.0 | subsd         xmm10, QWORD PTR [rax+r8] | [2]

You can see, the load latency is not identified as part of the LCD (and on top, the shl instruction should be part of the CP which it is for CSX, strange...). I think we need to find a more generic way to handle this in the sense of treating the intrinsic load as individual instruction (which I thought is already the case to be honest). I will be out of office for the next two weeks but will sit down and think about that as soon as I'm back, I'm happy to hear your ideas on where we need to change something in case you find time for this.

JanLJL avatar Apr 04 '25 09:04 JanLJL

Hi @JanLJL,

I am a bit confused, because when I try your example in my pleroy:Load2 branch (commit 5635d2d8dfbf53d25e1d60115e49f41143a26a3d) with the command:

python -m osaca --arch skx a.asm --export-graph a.dot --ignore-unknown > a.osaca

I get:

Combined Analysis Report
------------------------
                                     Port pressure in cycles                                     
     |  0   - 0DV  |  1   |  2   -  2D  |  3   -  3D  |  4   |  5   |  6   |  7   ||  CP  | LCD  |
--------------------------------------------------------------------------------------------------
   1 | 0.00        |      |             |             |      |      | 1.00 |      ||  1.0 |  1.0 |   shl         rax, 5
   2 | 0.00        | 0.50 | 0.50   0.50 | 0.50   0.50 |      | 0.50 |      |      ||  4.0 |  4.0 |   subsd       xmm10, QWORD PTR [rax+r8]
   3 | 1.00        | 0.49 |             |             |      | 0.50 |      |      ||  7.0 |  7.0 |   cvtsd2si    rax, xmm10
   4 |             |      | 0.50   0.50 | 0.50   0.50 |      |      |      |      ||  4.0 |      |   movsd       xmm7, QWORD PTR [rax+r8+16]

       1.00          1.00   1.00   1.00   1.00   1.00          1.00   1.00           20.0   16.0  




Loop-Carried Dependencies Analysis Report
-----------------------------------------
   1 | 16.0 | shl         rax, 5                  | [1, 1.875, 2, 3]
   2 |  4.0 | subsd       xmm10, QWORD PTR [rax+r8]| [2]

and the graph looks like this: a

Overall this looks fine to me.

This being said, now that I am looking at this a bit more, I am not really happy with the way that loads are handled (I didn't pay much attention to stores). It feels rather clunky, both in my PR and in preexisting code. In particular, there is at most one load node for a given instruction, which cannot be correct. In the following example:

shl         rax, 5
shl         rbx, 5
vfmadd132pd xmm10, QWORD PTR [rax+r8], QWORD PTR [rbx+r8]

you really need two load nodes.

I'm not sure that promoting loads to their own node is actually wise. Conceptually, the load comes from the MemoryOperand and maybe the latency and dependencies should be recorded there (although that would be a major change). I need to give it more thought, though.

pleroy avatar Apr 05 '25 10:04 pleroy

Hi @JanLJL, and sorry that I dropped the ball on this PR.

What needs to happen for this PR to be merged? As mentioned in my comment from Apr 5, I couldn't reproduce the problem you had with a LCD on skx. Can you help me reproduce it?

As to the more general question of whether this is the right approach to handling LOAD dependencies, that's an interesting question, but potentially a very complicated one, so I think there is still value to merging this bug fix.

pleroy avatar Jul 29 '25 18:07 pleroy

Hi @pleroy ,

first of all, I am terribly sorry that it took me so long to get back to you! I remembered this PR a few times over the last months but never found the time to solve the load issue on a global base. I think you are absolutely right and having this PR in there is definitely an improvement for now until I can clean up the DBs. I cannot reproduce the results I had the last time neither, I don't know what was going on there...

With regards to your second example, i.e.

shl         rax, 5
shl         rbx, 5
vfmadd132pd xmm10, QWORD PTR [rax+r8], QWORD PTR [rbx+r8]

the FMA instruction only exists with one memory access, so it would be rather a

shl         rax, 5
shl         rbx, 5
vmovupd     xmm1, QWORD PTR [rax+r8]
vfmadd132pd xmm10, xmm1, QWORD PTR [rbx+r8]

but you can see that the CP is still not correct due to the way we calculate the latency with implicit loads but this is independent from the PR. We are currently finishing work on a new benchmarking tool that allows us to measure latencies operand-wise instead of instruction-wise and we plan to implement this into OSACA for more accurate dependency modeling.

Anyways, thank you so much for your patience!

JanLJL avatar Aug 12 '25 14:08 JanLJL