blis icon indicating copy to clipboard operation
blis copied to clipboard

Add skew-symmetric BLAS operations

Open devinamatthews opened this issue 1 year ago • 21 comments

This PR adds a number of level-2 and level-3 skew-symmetric (and skew-hermitian) BLAS operations, defining the essential operations of a "Skew-BLAS" interface. These operations have been added as full 1st-class citizens of the BLIS API complete with testsuite and mixed-precision/mixed-domain support (level-3 only).

devinamatthews avatar Apr 26 '24 22:04 devinamatthews

@myeh01 @nick-knight @Aaron-Hutchinson can the SiFive team please review commit b986782? I had to delve into the RISC-V assembly there and I'm only ~80% sure I did it right.

devinamatthews avatar Apr 26 '24 23:04 devinamatthews

@fgvanzee again the Travis CI build failed to trigger...

devinamatthews avatar Apr 26 '24 23:04 devinamatthews

@fgvanzee again the Travis CI build failed to trigger...

I don't remember if there was anything we could do to fix it on our end. Might be that we just have to wait and then make a dummy commit to try to trigger again?

fgvanzee avatar Apr 26 '24 23:04 fgvanzee

I don't remember either but it is annoying

devinamatthews avatar Apr 26 '24 23:04 devinamatthews

There we go

devinamatthews avatar Apr 26 '24 23:04 devinamatthews

@devinamatthews Confirming I got your message. It looks like the register allocation in dotxaxpyf changed: thanks for taking a stab at it. I think @myeh01 wrote this one, I'll ask him to review it. (Michael, it looks like what's changed is the microkernel now has separate alpha values for the "dot" and "axpy" parts.)

nick-knight avatar Apr 26 '24 23:04 nick-knight

Travis CI failed for x280 so I guess I did do something wrong.

devinamatthews avatar Apr 27 '24 04:04 devinamatthews

Running the testsuite, it looks like s and d are passing while c and z are failing. For c and z, when I change the register allocation from fa4, fa5 to fa0, fa1 for alphaw, it passes the dotxaxpyf tests locally, so I think Devin modified the code correctly. My guess is the inline assembly I wrote is not generating the code I want it to (maybe the compiler is overwriting some floating-point registers in between asm blocks). I'll need some time to study the objdump to see what's going on.

myeh01 avatar Apr 27 '24 08:04 myeh01

After looking at the objdump, it looks like the compiler is using fa4 and fa5 in some branches involving floating point comparisons. To prevent this issue from reappearing, I would like to go back through all the inline assembly files I wrote and replace any manually allocated floating point (and integer) registers with C variables. I should be able to get it done this week. What would be the best way to get these changes merged into the repo? Should I submit a PR to this branch or master?

myeh01 avatar Apr 29 '24 10:04 myeh01

@myeh01 A quicker fix might be to add clobbers. We should be using these anyway whenever we use inline asm with explicit register allocation, whether X-, F-, or V-registers.

Going the other direction, I think we might be better off using generic C for all the scalar stuff, and only using inline asm for the vector stuff (when intrinsics don't suffice). I don't think we'll lose much in performance, and it would make the code much more maintainable and retargetable.

nick-knight avatar Apr 29 '24 17:04 nick-knight

@nick-knight I originally tried just adding the output register to the clobber list of any floating-point load, e.g.

__asm__(FLT_LOAD "fa5, %1(%0)" : : "r"(alphaw), "I"(FLT_SIZE) : "fa5");

But the compiler still uses fa5 for floating-point comparisons. Does the clobber only apply to that block of inline asm? Or maybe I did this incorrectly and should add clobbers in other places as well. There's also this, but then we would have to use C variables anyways.

Edit: Yeah, I think replacing the scalar stuff with generic C would be more robust. I started doing this for some parts of cdotxaxpyf and it wasn't too hard.

Edit 2: After looking through some of the code again, I think I would also like to replace some inline asm code with intrinsics.

myeh01 avatar Apr 29 '24 20:04 myeh01

Does the clobber only apply to that block of inline asm?

Correct. If you don't want the compiler to overwrite fa5 between this asm statement and a subsequent one, you'll have to merge them into the same statement. This tends to snowball, necessitating rewriting conditionals and loops in assembly, etc.

There's also this, but then we would have to use C variables anyways.

Correct.

I think replacing the scalar stuff with generic C would be more robust.

I agree.

Our code still has lurking risks related to our explicit allocation of V-registers: we are trusting the compiler not to generate any vector code between each pair of asm statements with a RAW dependence through a V-register. Hardening this will probably snowball in the manner described above, so I don't propose we attempt to tackle it here.

nick-knight avatar Apr 29 '24 20:04 nick-knight

@devinamatthews How would you like to proceed? There are a few short-term solutions we discussed above. Longer-term, I'd like to rewrite the inline assembly files to be more robust (probably using intrinsics where it won't significantly impact performance). Please let me know what I can do to help.

myeh01 avatar Apr 29 '24 21:04 myeh01

One perspective is that this ukernel interface change exposed a bug in SiFive's implementation of the legacy ukernel interface. To proceed, the defective ukernel implementation could be deleted in this PR --- reverting to a generic implementation --- and then reintroduced, upgraded and corrected, in a subsequent PR. This might be the cleanest way forward.

nick-knight avatar Apr 29 '24 22:04 nick-knight

I'm not in a huge rush to merge. If it takes say a month or less to fix it properly then I can wait. Otherwise yes we could revert to generic and fix later. This wouldn't require deleting anything, just commenting out the kernel registration.

devinamatthews avatar Apr 29 '24 22:04 devinamatthews

Mind if we sync up in a week or two? I'll start working on it this week and hopefully by then I'll have a sense of how much more time it would take.

myeh01 avatar Apr 30 '24 00:04 myeh01

Sure.

devinamatthews avatar Apr 30 '24 04:04 devinamatthews

@devinamatthews I'm steadily working through cleaning up all the kernels, but I don't think I'll be able to finish it in the next two weeks. I'm also trying to balance this work with other projects I need to work on, so it may take a few more weeks. It may be best to follow Nick's suggestion and temporarily disable the ~~sifive_x280 config~~ failing microkernel (sorry, misread Nick's comment form earlier) until the clean up is complete.

myeh01 avatar May 10 '24 23:05 myeh01

It may be best to follow Nick's suggestion and temporarily disable the sifive_x280 config until the clean up is complete.

I don't propose disabling the whole configuration, just removing the one ukernel that's causing issues. IIUC, this will cause BLIS to default to a generic/reference implementation.

nick-knight avatar May 10 '24 23:05 nick-knight

@myeh01 We've decided not to include this PR in the next release so there's not much time pressure.

devinamatthews avatar May 14 '24 19:05 devinamatthews