Field G. Van Zee
Field G. Van Zee
@stepannassyr Please give 9bb23e6 (which is currently the head of the `dev` branch) a try and let me know if any tweaks are required before merging into `master`.
Thanks, @stepannassyr. Please keep us updated.
We have no plans at this time. But I say that literally. You may or may not have already noticed this, but some of these compound ("fused") level-2 operations, such...
> The increase in cache coherency traffic can be offset by the savings of sharing B. Can you elaborate on the savings you're referring to here? The regime I envision...
Let's assume `ic_nt` = 4 and `jc_nt` = 2. This results in two packed panels of B being created. Each panel of B would be shared across 4 threads.
I guess what I don't quite follow is this perceived benefit of threads "sharing" B.
FWIW, @dnparikh already has performance data for `trsm` on ThunderX2 (which has a private L2 cache) that shows a *big* difference between pushing all parallelism to the jr loop vs....
> Yes I imagine relying on cache coherency performance on ARM is a problem. Can you elaborate?
I couldn't gather any meaningful inferences from the data I collected on my Haswell workstation. (Not enough cores to play with.) Maybe I'm willing to punt on this issue for...
I understand. This issue was always about changing the default values to values that would work as the best starting point.