blis Create a valid Neoverse N1 target.

This PR adds a valid Arm Neoverse N1 compilation target using Armv8 kernels. It creates the appropriate registry information and can autodetect a N1 cpu.

Mar 14 '22 19:03 everton1984

Having a clear interface and arch detection makes sense indeed, however without proper tuning, mergers/reviewers might not see this as a priority. Just guessing.

"The establishment" here. @everton1984 thanks for your work but @egaudry is pretty much right; it is best to have specifically-tuned block sizes and/or kernels with performance numbers before creating a new sub-configuration. Otherwise it is just easier to use the thunderx2 subconfig directly. I'll ask Jeff Diamond on the status of the tuned N1 parameters since that code may still be in the clutches of Oracle's lawyers.

Apr 07 '22 15:04 devinamatthews

Having a clear interface and arch detection makes sense indeed, however without proper tuning, mergers/reviewers might not see this as a priority. Just guessing.

"The establishment" here. @everton1984 thanks for your work but @egaudry is pretty much right; it is best to have specifically-tuned block sizes and/or kernels with performance numbers before creating a new sub-configuration. Otherwise it is just easier to use the thunderx2 subconfig directly. I'll ask Jeff Diamond on the status of the tuned N1 parameters since that code may still be in the clutches of Oracle's lawyers.

@devinamatthews Thanks for answering. No problem it makes sense, I can generate the parameters just wanted to know before trying something ad-hoc if there is a particularly defined procedure to obtain them.

Apr 07 '22 15:04 everton1984

The block sizes can, to some extent, be determined analytically, see https://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf. A basic non-analytical strategy is:

Run a series of problems with m=MR, n=NR, and increasing k. Note that you will need to use a row- or column-major C matrix as preferred by the microkernel. Plot performance vs. k; the optimal kc should be: a. The peak of the plot if the curve is sharpy peaked. b. The smallest value such that good performance is achieved if the plot has a large plateau.
Run a series of problems with n=NR, k=KC, and increasing m (you might want to try different transpose options for A as well). As before, the optimal MC is either the peak or the smallest value that gives good performance.
The value of NC doesn't usually affect performance much, but you can also try a similar procedure as for KC and MC. Note that NC should in general be fairly large compared to MC.
Confirm performance for large square matrices and tweak as necessary. Finding the best threading parameters is another challenge which perhaps I can describe separately if you're interested.

Final note: The block sizes must satisfy MC%MR == 0 and NC%NR == 0. If possibly it doesn't hurt to have all three cache block sizes as multiples of both MR and NR unless this choice is too restrictive. It may also help to avoid large powers of 2.

Apr 07 '22 16:04 devinamatthews

The block sizes can, to some extent, be determined analytically, see https://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf. A basic non-analytical strategy is:
1. Run a series of problems with m=MR, n=NR, and increasing k. Note that you will need to use a row- or column-major C matrix as preferred by the microkernel. Plot performance vs. k; the optimal kc should be:
   a. The peak of the plot if the curve is sharpy peaked.
   b. The smallest value such that good performance is achieved if the plot has a large plateau.

2. Run a series of problems with n=NR, k=KC, and increasing m (you might want to try different transpose options for A as well). As before, the optimal MC is either the peak or the smallest value that gives good performance.

3. The value of NC doesn't usually affect performance much, but you can also try a similar procedure as for KC and MC. Note that NC should in general be fairly large compared to MC.

4. Confirm performance for large square matrices and tweak as necessary. Finding the best threading parameters is another challenge which perhaps I can describe separately if you're interested.
Final note: The block sizes must satisfy MC%MR == 0 and NC%NR == 0. If possibly it doesn't hurt to have all three cache block sizes as multiples of both MR and NR unless this choice is too restrictive. It may also help to avoid large powers of 2.

@devinamatthews Thanks a lot! Let me find the correct parameters then.

Apr 07 '22 16:04 everton1984