easyaspi314

Results 132 comments of easyaspi314

> is it possible to break this PR into smaller stages ? Sure, I can do that. Sorry I adhd'd again 😵‍💫 I think that the isolated changes that are...

The first thing I'll do is find the minimum amount of `#pragma GCC optimize("-O3")` or `__attribute__((optimize("-O3")))` to shut up GCC

I have a rough draft of the hashLong refactor that is designed to be dispatched, but `update()` performance is pretty bad due to the function call latency (since consumeStripes() would...

> As for code size, c6dc92f also reduces it and may implement more robust dispatch functionality. > I should go back and finish that. I kinda got distracted by another...

> > I should go back and finish that. I kinda got distracted by another project 😅 > > Ideally, I would like to publish release `v0.8.2` in the very...

From a small amount of digging. c7g (AWS Graviton 3) seems to be based on the Neoverse V1, which is SVE-256. Looking at the [optimization guide](https://developer.arm.com/documentation/pjdoc466751330-9685/latest/): - Interleaving the two...

> So your guess is that `SVE` performance on Graviton3 could have been better, > but it's a matter of correctly optimizing for this architecture (manually or via compiler). Yes....

As for the interleaved SVE, try replacing the sve256 loop block (L294 to L306) with this. Disclaimer, I haven't tested this and my ordering might not be ideal. ```asm 10:...

As a side note I wonder if it is beneficial to just use NEON on SVE-128.

> > As for the interleaved SVE, try replacing the sve256 loop block (L294 to L306) with this. Disclaimer, I haven't tested this. > > Unfortunately, my access to this...