Question on padding for qfunctions with blocked backends
Hello libceed developers,
Thank you for the library, it's very impressive and extremely enabling for Palace, we're using it for almost all of our linear algebra at this stage. Given that, we've been investigating some data structure optimizations and have some questions about how the padded backends work:
- Where is the padding for a blocked backend defined, and how does the "padded data" get computed from the valid data, whilst not being accumulated/used for calculations in the output vector?
- How is this padding affected by the choice of stride pattern a user provides? Are users allowed to specify strides that aren't
CEED_STRIDES_BACKENDinCeedElemRestrictionCreateStridedand work with blocked backends? - Is changing the data storage pattern for a passive vector from all of i-th component then all of (i+1)-th etc. to all components for quad point j, then all for quad j+1, a bad idea for a non-cpu backend?
Long version:
We have QFunctions that use a passive vector where have stored the geometric factor data, computed in advance, along with two additional pieces of scalar data. We do this as there are many kernels that reuse the geometric factor data and have traded storage for time. This data is stored as a QVector, given there's nelem * nquad_per_elem * n_component (=11) pieces of data. An issue we've noticed through some cpu profiling however is that for the CEED_STRIDES_BACKEND the 11 components of the data are being stored with a stride of Q in the data vector, which for a large enough Q (88 for /cpu/self/ref/blocked for instance), is causing a cache miss for each value of the data drawn (this was all found using vtune, so we're moderately convinced it's a real effect). Namely all the 0-th component terms come first, then all the 1-th component, then all the 2-th etc, meaning we get ~10 L3 cache misses per quadrature point (resulting in a lot of DRAM calls).
Given this, we wanted to try changing the data storage to be coalesced by components, I'm not sure of the terminology for this within libCEED, but basically have all 11 components of data that are needed adjacent in the data vector, rather than all of the first components, then all of the second components etc. (thinking about the data vector as a matrix Q x 11, this amounts to row major vs column major storage). Thus a load for the first data component would almost certainly pull all the data for the given quadrature point into the cache in one go, and we'd no longer need a DRAM access (which is what has been observed with vtune).
Changing the strides from CEED_STRIDES_BACKEND to
CeedInt strides[3] = {
geom_data_size, // Stride between nodes
1, // Stride between components
geom_data_size * num_qpts // Stride between elements
};
and our writing into the storage vector gives the expected results for the non blocked backends, for a range of polynomial orders and various different QFunctions which are all storing this data, so I am fairly convinced we are populating the qdata vector appropriately (though there's some question of what happens for blocking in the final block, which I'll get to).
The mesh I am testing with has 14362 elements, and 11 quad points per element (tets), making up 157982 total entries in the qvector. Printing the Q=88 fed into the QFunction, which comes from /cpu/self/ref/blocked for instance, rem(157982 * 11, 88) = 22, meaning on the last block of 88 quadrature points being processed my mental model is there are only 22 valid quadrature points, the rest are padded with data that shouldn't be accumulated (I'm surmising this, I've not been able to disentangle it from the source so far). Presumably made possible by filling the active vector with zero for these quadrature points, making the data irrelevant at the cost of a few flops (but crucially no branch prediction required).
When I write out the contents of the in[0] (the passive vector that we are storing the data in) to the terminal as a Q x 11 matrix (each column being one of the 11 components of data I want), then I see there are 22 rows with a valid 0-column entry (this particular example it is easy to see if these entries are valid, they should be 1.000e+00, the other values are all smaller and floating point), the data for the first few rows looks like
[0-th row] 1.000e+00, 2.642e-09, 2.642e-09, 2.642e-09, 2.642e-09, 2.642e-09, 2.642e-09, 2.642e-09, 2.066e+01, -2.607e+02, -2.607e+02,
...
[2-th row] 1.000e+00, 1.000e+00, 1.000e+00, 1.000e+00, 1.000e+00, 1.000e+00, 1.000e+00, -4.300e+01, -2.118e+01, -2.118e+01, -2.118e+01,
...
[8-th row] 1.000e+00, 2.642e-09, 2.642e-09, 2.642e-09, 2.642e-09, 2.642e-09, 2.642e-09, 2.642e-09, 2.066e+01, -2.607e+02,
...
[11-th row] 1.000e+00, 1.000e+00, 1.000e+00, 1.000e+00, 1.000e+00, 1.000e+00, 1.000e+00*, -4.300e+01, -2.118e+01, -2.118e+01, -2.118e+01,
I've skipped the rows that are less interesting, but 2 things jump out here: 1) 1.00 gets repeated a lot on the 8 * i + 2-th rows, and 2) 2.642e-09 a lot on the 8 * i-th rows. From this it seems like there's some copying of valid q point data, to create "padded" q point data, but the copying of these into the block is obeying a stride pattern that isn't the strides that was provided when assembling the qvector originally, or in the element restriction used to access it later, but I think corresponds to the CEED_STRIDES_BACKEND default which I believe maps to {1, geom_data_size, geom_data_size*num_qpts} in this case. If I write the data in the other stride pattern, there are whole rows with negative values repeated, things like:
-2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02,
-2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02,
-2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02,
does a negative value indicate the value was padded in?
How can we alter this padding behaviour to respect the stride pattern or is this a fundamental limitation and one can't work with custom stride pattern for a data vector if blocked backends are desired? We've have been looking through https://libceed.org/en/latest/ and the examples but to no avail (searching things like block), is there any documentation on how the blocked backends transform relative to the non-blocked backends?
Appendices:
- https://github.com/awslabs/palace/blob/main/palace/fem/qfunctions/33/hcurl_33_qf.h an example qfunction, the
adjJtis the matrix variable. - https://github.com/awslabs/palace/blob/main/palace/fem/qfunctions/33/geom_33_qf.h the existing writing of the quadrate point data to the vector for later access.
- https://github.com/awslabs/palace/blob/main/palace/fem/qfunctions/33/utils_33_qf.h#L39 unpacking the adjugate data into a stack variable, note the stride is
Q, where we're changing it to effectively being1(and skipping this copying entirely). - https://github.com/awslabs/palace/blob/main/palace/fem/mesh.cpp#L180-L189 calls to ceed with the default stride (wrapped in a helper for checking error codes, which can be ignored).
The CPU blocked backends pack 8 quadrature points together (the nth point from a batch of 8 separate elements) to facilitate vectorization of the instructions. If you are using the blocked backends, this pattern is deeply baked into how libCEED processes the quadrature points in calls to your QFunction code, so it cannot be bypassed. Currently, the only way to avoid this is to use the CPU serial backends instead.
In answer to question 1, if the final batch of 8 elements contains fewer than 8 elements, then the data from the final element is replicated to fill out the block. This is to prevent false NaNs being generated or other issues that may trip up debuggers.
In answer to question 2, this internal ordering is totally independent from any strides the user decides to use. CEED_STRIDES_BACKEND is the indication that the data stored in the CeedVector will only be used by libCEED and the backend is free to order the data however it wants. If the user provides strides, then the external representation of the data (outside of any calls to a CeedOperator or CeedQFunction) will be in the order given by the user but the internal representation (when libCEED calls the user QFunction code for example) will use the ordering the backend needs.
If any vector associated with a strided restriction is not stored with CEED_STRIDES_BACKEND, then when using this data the backend will make a copy of this vector for internal use in its preferred ordering. If you are using /cpu/self/[opt, avx, xsmm]/blocked, then this will be a batch of 8 elements interleaved and If you are using /cpu/self/[opt, avx, xsmm]/serial, then this will be one element at a time. In either case though, the data coming into the quadrature point function is still expected to be interleaved with quadrature points as the fast index, to match how data comes out of the tensor contractions from the basis application.
Note - if performance is important, I would not use /cpu/self/ref/* backends and only use /cpu/self/[opt, avx, xsmm]/* backends. Ref uses more memory and isn't as aggressively optimized.
I think the answer to question 3, as libCEED is currently designed, is 'yes'?
Using a mix of data that is contiguous by quadrature point and contiguous by component would, I think, be worse for memory access patterns, so to do this properly, I think the way to go would be to make new CPU backends that totally restructures how data is organized and processed to make data for individual quadrature points contiguous (perhaps /cpu/self/*/coalesced?). I don't know how much work that would take and its not needed for any of our current funded projects. I'm not opposed to the idea at first look, but I have to respect the priorities of the projects that fund me.
I know how I would do this on the CPU side of the codebase, but I don't see an obvious way to do this on the GPU side of the codebase. On the GPU side, every thread is managing a single quadrature point at a time.
I'm not sure if all that made sense, so please let me know where I can clarify!
the 11 components of the data are being stored with a stride of Q in the data vector, which for a large enough Q (88 for /cpu/self/ref/blocked for instance), is causing a cache miss for each value of the data drawn (this was all found using vtune, so we're moderately convinced it's a real effect). Namely all the 0-th component terms come first, then all the 1-th component, then all the 2-th etc, meaning we get ~10 L3 cache misses per quadrature point (resulting in a lot of DRAM calls).
My first thought when reading this was that something isn't being optimized correctly for vectorization. The CPU (theoretically, in my mental model of vectorized instructions) shouldn't need to do that form of memory access if the memory is being used by SIMD instructions. That kind of memory makes me think that the either: a. the compiler correctly has SIMD instructions, but is repacking the data for some reason (this component-major ordering is what should be fed into a SIMD instruction) b. the compiler isn't turning the QFunction for loop into SIMD instructions, in which case having quadrature-point-major ordering for better cache coalescence will lead to better performance.
IDK if vtune has useful utilities for determining vectorization of the quadrature-point for loop, but that's something I would double check.
Edit: Maybe to fill in gaps, but properly vectorized code is going to be working on multiple quadrature points simultaneously. Thus, the vector registers will be filled with same-component-different-qpoint data. The CeedVectors are sorted so that no data repacking is necessary. The fact that the code is needing to grab data in the manner you describe doesn't align with how the "optimal" path should operate.
Hi @jeremylt,
Thank you for the rapid reply! I suspected this might be tangled into some ceed assumptions, hence opening the issue, sounds like my guess was correct. In general we are using /cpu/self/xsmm/blocked, I was only using the different ref and serial backends in attempting to understand what was going on underneath, hoping to remove the impacts of what I presumed were optimizations underneath. I also didn't mean to suggest any extra work on your part or the ceed project, I was only hoping to improve my understanding. The profiling data that began this all came from using cpu libxsmm blocked backend.
On the data organization front, the data necessary at each quadrature point is a) the u_{0,1,2}, v_{0,1,2} of trial and test from the actual state vectors for which CEED_STRIDES_BACKEND feels absolutely correct (these are nedelec, raviart-thomas, H1 basis function coefficients, with lots of basis evaluation), and b) the 11 doubles representing our "passive vector" data, the two scalars and then the 9 components of the 3x3 matrix. The data pattern I'm thinking of is that the 11 doubles be adjacent in the Q vector, rather than separated by stride of Q in the data vector passed in. This second set of data requires no basis evaluation, and each entry is only in one element.
In the second flow diagram from here, this data is a bit like rho_E = rho_Q being a 3x3 matrix, and storing those components adjacent to each other, and additionally rho_L = rho_E because the field is like that from a Discontinuous Galerkin solution (no inter element sharing at all, all internal quadrature points).
In either case though, the data coming into the quadrature point function is still expected to be interleaved with quadrature points as the fast index, to match how data comes out of the tensor contractions from the basis application.
For this data there's no basis that's evaluated, and if I understand the restriction correctly, there's only 1 element per entry in the vector, as it's already quadrature point data. So per quadrature point, the data is pulled from 11 different locations in the data vector, separated by Q (x[i_quad:Q:end] in python-esque notation), or chosen from 11 contiguous locations with no separation (what I'm trying to achieve) (x[i_quad*n_comp:i_quad*n_comp+n_comp] in python-esque notation).
In answer to question 2, this internal ordering is totally independent from any strides the user decides to use.
CEED_STRIDES_BACKENDis the indication that the data stored in the CeedVector will only be used by libCEED and the backend is free to order the data however it wants. If the user provides strides, then the external representation of the data (outside of any calls to a CeedOperator or CeedQFunction) will be in the order given by the user but the internal representation (when libCEED calls the user QFunction code for example) will use the ordering the backend needs.
So for a vector that represents the data at the quadrature points already, namely nelem * ncomponents * nquadpoints in length, assembled from another QFunction, the striding is effectively irrelevant once within a kernel as it will be copied to another vector that always obeys CEED_STRIDES_BACKEND.
On the GPU side, every thread is managing a single quadrature point at a time.
The data stored in the data vector would be separated out by quadrature point, so each gpu thread would take 1 non-overlapping subset of the data vector to get its data. As opposed to right now taking 11 non-overlapping single entries in the data vector. I believe this data organization ought to improve a gpu kernel, as all the data for a given quadrature point is adjacent to each other, rather than spread out by a stride of Q through out the passive vector.
Thanks for the reply @jrwrigh,
My first thought when reading this was that something isn't being optimized correctly for vectorization. The CPU (theoretically, in my mental model of vectorized instructions) shouldn't need to do that form of memory access if the memory is being used by SIMD instructions. That kind of memory makes me think that the either: a. the compiler correctly has SIMD instructions, but is repacking the data for some reason (this component-major ordering is what should be fed into a SIMD instruction) b. the compiler isn't turning the QFunction for loop into SIMD instructions, in which case having quadrature-point-major ordering for better cache coalescence will lead to better performance.
The data from the Q strides are being copied into a stack variable see MatUnpack33 on L21, which is defined here. If I understand you correctly, this coalescing into a stack variable might be preventing the SIMD instructions? I could for instance rewrite the matrix vector product that uses the 3x3 variable into using Q strides directly on the original data pointer (at the cost of making it a bit harder to read, but if that's going to boost performance I can definitely do it).
@jrwrigh , this is what we've been seeing with VTune. This was built using -O2. Although compiling with -O3 yields nearly identical total execution times, it doesn't provide the detailed stack trace that we see here.
If I understand you correctly, this coalescing into a stack variable might be preventing the SIMD instructions?
Maybe, but there's a few things I'd try first. HONEE and Ratel does these operations ourselves, so this isn't an obvious "mistake" in my mind and something I would also like to not have to work around. The CPU compiler should be able to see the vectorizable parts of the loop, regardless of the meddling; we set up the CEED_QFUNCTION and CEED_QFUNCTION_HELPER macros to force aggressive inlining (see #1188 and #1048 respectively), at which point any compiler worth it's salt should be able to see that the data reorganization doesn't impact the actual numerical operations. There are a few things I'd double check first.
- Verify that the
CEED_QFUNCTIONandCEED_QFUNCTION_HELPERmacros are set correctly for the compiler you're using.- These are set in
/include/ceed/types.hbased on compiler macro definitions. They should work, but perhaps there's an edge case there (e.g. I'm not sure these have been tested on the newest clang-based intel compilers yet, perhaps they use a different macro definition and/or attribute flag).
- These are set in
- Perhaps more
restrictqualifiers are good idea to ease any correctness concerns the compiler may have.- Even if "any compiler worth it's salt should be able to see that the data reorganization doesn't impact the actual numerical operations", it won't make the optimization if it can't guarantee correctness.
- Double check if the computation in the for loop is being vectorized properly.
- If the unpacked data isn't used in any vectorized instructions, the compiler will then have to fall back to the "normal" for loop methodology, at which point the memory access pattern you're seeing will occur.
- There are some extra flags beyond
-O3that can allow for vectorization that otherwise wouldn't occur (e.g.-march=native -ffp-contract=fast -fopenmp-simd -funsafe-math-optimizations, not including architecture-specific flags) - I generally don't recommend it, but
-Ofastwould at least be an interesting comparison. (don't recommend it normally as it breaks a ton of "normal" floating point behavior, such asisnan(), but it'll help remove the IEEE FP behavior that could prevent vectorization).
@laylagi Oof, yeah. That's not good at all. If y'all can pin point which QFunctions those MatUnpack33 calls are coming from and see if the assembly of those functions lack any obvious vectorization; if they are fully vectorized (or close to it), then we're probably barking up the wrong tree.
Yea, I definitely understand that there is no basis involved. We heavily use passive vectors in a lot of our work.
The blocked backends are intended to encourage vector instructions in the compiler (worth checking if vectorization is failing, as that is a common cause of poor CPU performance for a new QFunction for us).
The user can specify how their own data is handled but cannot change internal ordering of the backend.
We don't want the QFunction code to have to use multiple data access indexing patterns, which is why we adopted this indexing pattern for QFunctions. I think it would break the abstraction of the QFunction being independent of compilation target if sometimes the data was contiguous and sometimes not.
Note, the data for each quadrature point is contiguous on the GPU gen backends (our recommended backends for best performance) as each thread only has its own data, but it's a very different internal design.
The CPU code is firmly designed to handle 8 quadrature points via SIMD instructions. I think we would need to significantly reorganize the internals of a CPU backend to get it to operate on only one quadrature point at a time with contiguous data in a way that doesn't break the abstraction between GPU and CPU. It's doable but I'd want to think about the design first to try to make sure we aren't abandoning SIMD performance gains.
@laylagi Oof, yeah. That's not good at all. If y'all can pin point which QFunctions those
MatUnpack33calls are coming from and see if the assembly of those functions lack any obvious vectorization; if they are fully vectorized (or close to it), then we're probably barking up the wrong tree.
The first two are called from this QFunction. Actually, non of the helper functions in this QFunction are vectorized. So, I think you are right. The culprit is probably here. I wonder if it's the way the input data is defined. In Ratel and HONEE each input variable is defined in a distinct field. But that shouldn't interfere with vectorization, right? Well, I can try accessing each of them inside the quadrature loop and see how that goes!
Double check if the computation in the for loop is being vectorized properly. If the unpacked data isn't used in any vectorized instructions, the compiler will then have to fall back to the "normal" for loop methodology, at which point the memory access pattern you're seeing will occur. There are some extra flags beyond -O3 that can allow for vectorization that otherwise wouldn't occur (e.g. -march=native -ffp-contract=fast -fopenmp-simd -funsafe-math-optimizations, not including architecture-specific flags)
I've included all of these flags, except for -funsafe-math-optimizations. I can try adding that too.
Also, adding __restrict__ in MatUnpack33 didn't help with the performance.
One "hack" that could also help this particular use case would be going from blocking by 8 elements to blocking by 4. @jeremylt is there any particular reason 8 was chosen? A reduction to, say, 4 would help alleviate the cache misses while still allowing for optimal memory access patterns if and when a compiler can properly vectorize code.
Modern CPUs have 8 lanes
I initially had that thought too, but I'm not sure that's what counts here? Like if we're talking about AVX-512, it can handle 8 doubles in each register. But, for the QFunction for loop, that would be 8 quadrature points per AVX register, not 8 elements. Is the idea just to keep the number of quadrature points divisible by 8?
Yes, we interlace 8 elements so the number of quadrature points is a multiple of 8 and we use all 8 lanes
I can't run vtune profiling setup on my local mac laptop so can't show equivalent images to that from @laylagi above, but running with the various vectorization report directives from clang 21.1.5 (-march=native -ffp-contract=fast -fopenmp-simd -funsafe-math-optimizations -Rpass=loop-vectorize -Rpass-missed=loop-vectorize -Rpass-analysis=loop-vectorize -O3 -g) I get this:
hcurl_33_qf.h:17:18: remark: the cost-model indicates that interleaving is not beneficial [-Rpass-analysis=loop-vectorize]
17 | CeedPragmaSIMD for (CeedInt i = 0; i < Q; i++)
| ^
hcurl_33_qf.h:17:18: remark: vectorized loop (vectorization width: 4, interleaved count: 1) [-Rpass=loop-vectorize]
which I believe means vectorization is occurring (though with a width of 4, I guess there are 4 lanes for an m1?). There are variations on the same comment from every other kernel which uses the functions.
Running a gcc build on an arm instance with the vectorization reports, it vectorizes with variable length arrays (good), but does report some difficulties
hcurl_33_qf.h:21:16: optimized: Inlining void MultAtBCx33(const CeedScalar*, const CeedScalar*, const CeedScalar*, const CeedScalar*, CeedScalar*)/9812 into int f_apply_hcurl_33(void*, CeedInt, const CeedScalar* const*, CeedScalar* const*)/9818 (always_inline).
hcurl_33_qf.h:20:16: optimized: Inlining void MatUnpack33(const CeedScalar*, CeedInt, CeedScalar*)/9810 into int f_apply_hcurl_33(void*, CeedInt, const CeedScalar* const*, CeedScalar* const*)/9818 (always_inline).
hcurl_33_qf.h:19:17: optimized: Inlining void CoeffUnpack3(const CeedIntScalar*, CeedInt, CeedScalar*)/9797 into int f_apply_hcurl_33(void*, CeedInt, const CeedScalar* const*, CeedScalar* const*)/9818 (always_inline).
hcurl_33_qf.h:15:36: optimized: unswitching loop 1 on ‘if’ with condition: _129 > 0
hcurl_33_qf.h:15:36: note: optimized sizes estimated to 0 (true) and 0 (false) from original size 137
hcurl_33_qf.h:15:36: optimized: loop vectorized using variable length vectors
hcurl_33_qf.h:15:36: optimized: loop vectorized using variable length vectors
hcurl_33_qf.h:9:1: note: vectorized 2 loops in function.
hcurl_33_qf.h:9:1: missed: statement clobbers memory: vect_patt_596.1602_633 = .MASK_GATHER_LOAD (_632, vect__131.1601_631, 1, { 0, ... }, loop_mask_610);
hcurl_33_qf.h:9:1: missed: statement clobbers memory: vect_patt_596.1602_633 = .MASK_GATHER_LOAD (_632, vect__131.1601_631, 1, { 0, ... }, loop_mask_610);
hcurl_33_qf.h:9:1: missed: statement clobbers memory: vect_patt_597.1605_638 = .MASK_GATHER_LOAD (_637, vect__507.1604_636, 8, { 0.0, ... }, loop_mask_610);
hcurl_33_qf.h:9:1: missed: statement clobbers memory: vect_patt_598.1606_641 = .MASK_GATHER_LOAD (_640, vect__507.1604_636, 8, { 0.0, ... }, loop_mask_610);
hcurl_33_qf.h:9:1: missed: statement clobbers memory: vect_patt_599.1607_644 = .MASK_GATHER_LOAD (_643, vect__507.1604_636, 8, { 0.0, ... }, loop_mask_610);
hcurl_33_qf.h:9:1: missed: statement clobbers memory: vect_patt_600.1608_647 = .MASK_GATHER_LOAD (_646, vect__507.1604_636, 8, { 0.0, ... }, loop_mask_610);
hcurl_33_qf.h:9:1: missed: statement clobbers memory: vect_patt_601.1609_650 = .MASK_GATHER_LOAD (_649, vect__507.1604_636, 8, { 0.0, ... }, loop_mask_610);
hcurl_33_qf.h:9:1: missed: statement clobbers memory: vect_patt_602.1610_653 = .MASK_GATHER_LOAD (_652, vect__507.1604_636, 8, { 0.0, ... }, loop_mask_610);
hcurl_33_qf.h:9:1: missed: statement clobbers memory: vect_patt_603.1611_656 = .MASK_GATHER_LOAD (_655, vect__507.1604_636, 8, { 0.0, ... }, loop_mask_610);
hcurl_33_qf.h:9:1: missed: statement clobbers memory: vect_patt_604.1612_659 = .MASK_GATHER_LOAD (_658, vect__507.1604_636, 8, { 0.0, ... }, loop_mask_610);
hcurl_33_qf.h:9:1: missed: statement clobbers memory: vect_patt_605.1613_662 = .MASK_GATHER_LOAD (_661, vect__507.1604_636, 8, { 0.0, ... }, loop_mask_610);
hcurl_33_qf.h:9:1: note: ***** Analysis failed with vector mode VNx2DI
hcurl_33_qf.h:9:1: note: ***** The result for vector mode VNx16QI would be the same
hcurl_33_qf.h:9:1: note: ***** Re-trying analysis with vector mode V16QI
Running with my guess on changing the stride on the passive vector though, it reports failure to vectorize even with variable length arrays. So missing the VNx2DI isn't ideal, but failing to get even variable length is more problematic. The small amount I'm reading about VNx2DI suggests it's probably the mask_gather_load that's preventing it, (which would be helped by contiguous memory rather than needing the strides), but I guess that's the price to pay for the generalization of the libceed qfunctions. Better to get vectorization across backends with only one kernel implementation, than a proliferation of specializations.
My hypothesis on why contiguous data isn't good fits your description @jeremylt, that because the stride doesn't match that used in the u and v access, the same instruction can't just be shifted across the data, the shift is different for each data array coming in. So though the data are all spaced out in the vector, because every access is strided equally between loop iterations, the compiler sees straight through it to vectorize.
It's unfortunate that "statement clobbers memory", but maybe that's not really a problem, as ultimately it is vectorizing (in fact hoisting the particular statement that causes that out of the loop for a particular example doesn't seem to help anyway, maybe ~3% gain, not worth the complexity). Given the overall structure of libceed's treatment of components, I think my original idea is just a bad one unfortunately, or at least would require a pretty significant alteration to how a vector field's basis is evaluated (i.e. array of structures than structure of arrays as current), which is likely a very bad idea.
I think there may be a few sharp edges laying around that I'm not anticipating off the top of my head, but we may be able to easily enable the ability to do blocked with a blocksize of 4 instead of 8 to see if that helps the situation. (This just occurred to me)
enable the ability to do blocked with a blocksize of 4 instead of 8 to see if that helps the situation.
The decision on 4 vs 8 could also be done at compile time based on the system architecture. Or maybe set a default value based on system architecture and then could have it as a runtime switch.
A runtime switch wouldn't quite work with how this is baked into libCEED. I think we'd probably want to compile both 4 and 8 (not that many functions, but they are very performance critical)
I can't run vtune profiling setup on my local mac laptop
Do you see the same poor performance on your laptop? I assume this is true, but just wanting to make sure. There could be differences between how GCC vs Clang vs Intel and ARM vs x86 at play here.
Yeah, I'd assume that the runtime switch would effectively be different backends, but transparent to the user.
Or it could be explicitly different backends and we use the same "most performant version is default" logic we normally use (i.e. /gpu/cuda uses /gpu/cuda/gen by default). So have /cpu/self/*/blocked/{4,8} and have the default for /cpu/self/*/blocked be determined at compile-time by architecture information.
But, for the QFunction for loop, that would be 8 quadrature points per AVX register, not 8 elements. Is the idea just to keep the number of quadrature points divisible by 8?
Padding the number of quadrature points wouldn't be that bad for qfunctions, but choosing 8 elements ensures that all tensor contractions will have an inner dimension that is a multiple of 8. (Indeed, the innermost dimension is over the elements for blocked backends.) As mentioned, we could reduce it to 4, I think trivially in blocked and without much work in avx. It might be worth testing to see if that helps anywhere before bothering to expose it in the interface.
We could also issue prefetch instructions that might better amortize those strides. The stride of 8*11 (704 B) is not large; it can compared with 8*27 (1728 B) and 8*64 (4096 B) for quadratic and cubic hexes. But L2 cache is still big enough unless you have a very large number of components.
But, for the QFunction for loop, that would be 8 quadrature points per AVX register, not 8 elements. Is the idea just to keep the number of quadrature points divisible by 8?
Padding the number of quadrature points wouldn't be that bad for qfunctions, but choosing 8 elements ensures that all tensor contractions will have an inner dimension that is a multiple of 8. (Indeed, the innermost dimension is over the elements for blocked backends.) As mentioned, we could reduce it to 4, I think trivially in
blockedand without much work inavx. It might be worth testing to see if that helps anywhere before bothering to expose it in the interface.We could also issue prefetch instructions that might better amortize those strides. The stride of 8_11 (704 B) is not large; it can compared with 8_27 (1728 B) and 8*64 (4096 B) for quadratic and cubic hexes. But L2 cache is still big enough unless you have a very large number of components.
Maybe a little extra context for our usage, we are dominated by tetrahedra in practice, due to the difficulties of hex meshing complex geometries, so we rarely get the wins from tensor contractions. I'm not sure if that impacts your reasoning here. Were there a test we can run on first compiling to pick between 4 or 8 (assuming that there is a measurable improvement), would be something we'd definitely perform, as in practice our users seems to be roughly divided between x86/arm64 cluster users (for whom 8 is probably great), and high end mac users (for whom it seems 4 might help).
Do you see the same poor performance on your laptop? I assume this is true, but just wanting to make sure. There could be differences between how GCC vs Clang vs Intel and ARM vs x86 at play here.
I didn't mean to give the impression the performance is poor, libceed is doing great. The vtune results were gcc + x86, which suggested we had some opportunities, originally inspiring the questions, but trying this on some other machines (including my mac), is making me think that vtune is intervening to make things worse somehow. My thinking at this stage is that mixing the data stride between sources is pretty poisonous for the performance, despite the Q stride being not ideal from a cache locality of the data perspective, the win from having the same across all accesses and it working with the basis evaluation wins through.
A note for clarity - internally all of our basis operations are framed as tensor contractions, as mathematically a matrix vector product is a special case of a tensor contraction, so even the non-tensor elements use the TensorContraction object internally
There could be differences between how GCC vs Clang vs Intel and ARM vs x86 at play here.
With intel compilers and optimization flags -xHost -ffp-contract=fast -fopenmp-simd -funsafe-math-optimizations -Rpass=loop-vectorize -Rpass-missed=loop-vectorize -Rpass-analysis=loop-vectorize -O3 -g, vectorization is not forced. Is this expected?
hcurlhdiv_32_qf.h:37:18: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
37 | CeedPragmaSIMD for (CeedInt i = 0; i < Q; i++)
^
It would depend on the loop body. For example, if it calls functions that don't vectorize (we did the series work with log1p because it doesn't vectorize in common implementations) or has branches that are unsafe/hard for the compiler to transform into masked vector arithmetic.
I forgot to mention that these for-loops are being vectorized with GCC and Clang, that's why I'm confused. Am I missing an optimization flag?
Which Intel compiler? (i.e. the latest clang-based ones, or the "classic" old ones)
If you're determining whether or not code was vectorized or not just by the error message, the Intel compiler complains louder about non-vectorized loops when you ask for vectorization (e.g. CeedPragmaSIMD). So it could be that none of the compilers are fully vectorizing those loops, but Intel is the one that's actually telling you.
Which Intel compiler? (i.e. the latest clang-based ones, or the "classic" old ones)
The classic ones!
Interesting! So, when gcc says:
hcurl_22_qf.h:16:36: optimized: loop vectorized using 32 byte vectors
hcurl_22_qf.h:16:36: optimized: loop vectorized using 16 byte vectors
hcurl_22_qf.h:16:36: optimized: loop vectorized using 32 byte vectors
hcurl_22_qf.h:16:36: optimized: loop vectorized using 16 byte vectors
hcurl_22_qf.h:10:1: note: vectorized 2 loops in function.
(Edit: hmm, there is only one quadrature loop in this function)
and intel says:
hcurl_22_qf.h:16:18: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
16 | CeedPragmaSIMD for (CeedInt i = 0; i < Q; i++)
We should believe intel?
I think it can sometimes be more straightforward to detect what is causing vectorization issues by looking at the assembly and seeing which parts of the code did not generate vector instructions (prefixed with v)
In particular, the v*pd ("packed double") instructions, which I expect GCC is indeed emitting. I like to use Linux perf and then look at the hot instructions. If those are v*pd, then vectorization has been effective. What does upstream clang do on your kernels? Is there a reason you are using legacy icc?
Yes, I do see v*pd instructions with GCC.
Is there a reason you are using legacy icc?
Sorry, I misunderstood. I am definitely using icx and icpx.
I used the same optimization flags to compile both libCEED and Ratel with Intel compilers and am seeing similar warnings. One thing I notice in common between libCEED and Palace is that the 1D QFunctions are successfully vectorized.
libCEED with icx
$ CC=icx CXX=icpx make OPT="-xHost -ffp-contract=fast -fopenmp-simd -funsafe-math-optimizations -Rpass=loop-vectorize -Rpass-missed=loop-vectorize -Rpass-analysis=loop-vectorize -O3 -g" -B
make: 'lib' with optional backends: /cpu/self/memcheck/serial /cpu/self/memcheck/blocked
CC build/interface/ceed-basis.o
CC build/interface/ceed-config.o
CC build/interface/ceed-elemrestriction.o
CC build/interface/ceed-fortran.o
CC build/interface/ceed-jit-source-root-default.o
CC build/interface/ceed-jit-tools.o
CC build/interface/ceed-operator.o
CC build/interface/ceed-preconditioning.o
CC build/interface/ceed-qfunction-register.o
CC build/interface/ceed-qfunction.o
CC build/interface/ceed-qfunctioncontext.o
CC build/interface/ceed-register.o
CC build/interface/ceed-tensor.o
CC build/interface/ceed-types.o
CC build/interface/ceed-vector.o
CC build/interface/ceed.o
CC build/gallery/identity/ceed-identity.o
CC build/gallery/mass-vector/ceed-vectormassapply.o
CC build/gallery/mass/ceed-mass1dbuild.o
CC build/gallery/mass/ceed-mass2dbuild.o
CC build/gallery/mass/ceed-mass3dbuild.o
CC build/gallery/mass/ceed-massapply.o
CC build/gallery/poisson-vector/ceed-vectorpoisson1dapply.o
CC build/gallery/poisson-vector/ceed-vectorpoisson2dapply.o
In file included from /data/home/lgh/projects/libCEED/gallery/poisson-vector/ceed-vectorpoisson2dapply.c:10:
./include/ceed/jit-source/gallery/ceed-vectorpoisson2dapply.h:23:18: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
23 | CeedPragmaSIMD for (CeedInt i = 0; i < Q; i++) {
| ^
1 warning generated.
CC build/gallery/poisson-vector/ceed-vectorpoisson3dapply.o
In file included from /data/home/lgh/projects/libCEED/gallery/poisson-vector/ceed-vectorpoisson3dapply.c:10:
./include/ceed/jit-source/gallery/ceed-vectorpoisson3dapply.h:23:18: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
23 | CeedPragmaSIMD for (CeedInt i = 0; i < Q; i++) {
| ^
1 warning generated.
CC build/gallery/poisson/ceed-poisson1dapply.o
CC build/gallery/poisson/ceed-poisson1dbuild.o
CC build/gallery/poisson/ceed-poisson2dapply.o
In file included from /data/home/lgh/projects/libCEED/gallery/poisson/ceed-poisson2dapply.c:10:
./include/ceed/jit-source/gallery/ceed-poisson2dapply.h:23:18: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
23 | CeedPragmaSIMD for (CeedInt i = 0; i < Q; i++) {
| ^
1 warning generated.
CC build/gallery/poisson/ceed-poisson2dbuild.o
In file included from /data/home/lgh/projects/libCEED/gallery/poisson/ceed-poisson2dbuild.c:10:
./include/ceed/jit-source/gallery/ceed-poisson2dbuild.h:23:18: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
23 | CeedPragmaSIMD for (CeedInt i = 0; i < Q; i++) {
| ^
1 warning generated.
CC build/gallery/poisson/ceed-poisson3dapply.o
In file included from /data/home/lgh/projects/libCEED/gallery/poisson/ceed-poisson3dapply.c:10:
./include/ceed/jit-source/gallery/ceed-poisson3dapply.h:23:18: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
23 | CeedPragmaSIMD for (CeedInt i = 0; i < Q; i++) {
| ^
1 warning generated.
CC build/gallery/poisson/ceed-poisson3dbuild.o
In file included from /data/home/lgh/projects/libCEED/gallery/poisson/ceed-poisson3dbuild.c:10:
./include/ceed/jit-source/gallery/ceed-poisson3dbuild.h:24:18: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
24 | CeedPragmaSIMD for (CeedInt i = 0; i < Q; i++) {
| ^
1 warning generated.
CC build/gallery/scale/ceed-scale.o
CC build/backends/ref/ceed-ref-basis.o
CC build/backends/ref/ceed-ref-operator.o
CC build/backends/ref/ceed-ref-qfunction.o
CC build/backends/ref/ceed-ref-qfunctioncontext.o
CC build/backends/ref/ceed-ref-restriction.o
/data/home/lgh/projects/libCEED/backends/ref/ceed-ref-restriction.c:195:26: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
195 | CeedPragmaSIMD for (CeedSize j = 0; j < CeedIntMin(block_size, num_elem - e); j++) {
| ^
/data/home/lgh/projects/libCEED/backends/ref/ceed-ref-restriction.c:195:26: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/data/home/lgh/projects/libCEED/backends/ref/ceed-ref-restriction.c:209:26: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
209 | CeedPragmaSIMD for (CeedSize j = 0; j < CeedIntMin(block_size, num_elem - e); j++) {
| ^
/data/home/lgh/projects/libCEED/backends/ref/ceed-ref-restriction.c:209:26: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/data/home/lgh/projects/libCEED/backends/ref/ceed-ref-restriction.c:69:22: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
69 | CeedPragmaSIMD for (CeedSize i = 0; i < elem_size * block_size; i++) {
| ^
/data/home/lgh/projects/libCEED/backends/ref/ceed-ref-restriction.c:87:22: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
87 | CeedPragmaSIMD for (CeedSize i = 0; i < elem_size * block_size; i++) {
| ^
6 warnings generated.
CC build/backends/ref/ceed-ref-tensor.o
CC build/backends/ref/ceed-ref-vector.o
CC build/backends/ref/ceed-ref.o
CC build/backends/blocked/ceed-blocked-operator.o
CC build/backends/blocked/ceed-blocked.o
CC build/backends/opt/ceed-opt-blocked.o
CC build/backends/opt/ceed-opt-operator.o
CC build/backends/opt/ceed-opt-serial.o
CC build/backends/opt/ceed-opt-tensor.o
CC build/backends/memcheck/ceed-memcheck-blocked.o
CC build/backends/memcheck/ceed-memcheck-qfunction.o
CC build/backends/memcheck/ceed-memcheck-qfunctioncontext.o
CC build/backends/memcheck/ceed-memcheck-restriction.o
/data/home/lgh/projects/libCEED/backends/memcheck/ceed-memcheck-restriction.c:208:24: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
208 | CeedPragmaSIMD for (CeedSize j = 0; j < CeedIntMin(block_size, num_elem - e); j++) {
| ^
/data/home/lgh/projects/libCEED/backends/memcheck/ceed-memcheck-restriction.c:208:24: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/data/home/lgh/projects/libCEED/backends/memcheck/ceed-memcheck-restriction.c:82:22: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
82 | CeedPragmaSIMD for (CeedSize i = 0; i < elem_size * block_size; i++) {
| ^
/data/home/lgh/projects/libCEED/backends/memcheck/ceed-memcheck-restriction.c:100:22: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
100 | CeedPragmaSIMD for (CeedSize i = 0; i < elem_size * block_size; i++) {
| ^
4 warnings generated.
CC build/backends/memcheck/ceed-memcheck-serial.o
CC build/backends/memcheck/ceed-memcheck-vector.o
CC build/backends/ceed-backend-weak.o
CC build/gallery/ceed-gallery-weak.o
LINK lib/libceed.so
/include/ceed/jit-source/gallery/ceed-vectorpoisson1dapply.h compiles without any warnings, while the 2D and 3D cases emit the same "loop not vectorized" warning.
Edit: I've been building Palace with Clang on my mac laptop. I'll get back to you regarding Clang when I find a way to profile it with perf.