chapel
chapel copied to clipboard
[Performance]: Fall back to looping for small bulk array transfers (potentially via AVE machinery)
The Array View Elision (AVE) optimization added in https://github.com/chapel-lang/chapel/pull/24390, includes a Short Array Transfer (SAT) optimization that fires for AVE'd assignments below a certain size. When this happens, a for-loop is used to execute the assignment rather than a bulk transfer operation.
There are cases where this optimization is not firing, but potentially should be. Specifically, for small slice array assignments like the following:
use BlockDist;
var a: [blockDist.createDomain({1..1024, 1..N})] int;
a[.., 2] = a[.., 1];
The AVE optimization in general is not firing for non-default-rectangular arrays — this prevents SAT from firing. Additionally, the slice assignment will eventually call several DR->DR transfers (one for each locale), which are also not subject to the SAT optimization. This issue suggests that the conditions for applying SAT should be relaxed to catch cases like the above distributed slice assignment.
Moreover, when a slice assignment involves strided accesses, there is more overhead involved in executing a bulk transfer, and thus the transfer-size threshold for SAT should be higher (i.e., it only makes sense to do a bulk transfer with slicing when the transfer is very large). Thus, is could be beneficial to have separate transfer-size thresholds for the strided and non-strided cases of SAT.