dynamorio
dynamorio copied to clipboard
Add encoder/decoder support for Arm's Scalable Vector Extension
Armv8-a's SVE is a big extension requiring a few changes to the AArch64 backend. At first, I think we need new register types for scalable vector (Z) and predicate (P) registers as well as support for the new SVE encodings.
https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
Xref #2626
Currently we can handle 10 SVE opcodes (out of 314) in the codec. Now that we are preparing to implement the rest of SVE, the notion of a variable vector register length in DynamoRIO needs to be addressed. Specifically, how will clients and tools like DrMemory and drcachesim work correctly on SVE hardware with different SVE vector lengths?
Currently we have OPSZ_SCALABLE
and OPSZ_SCALABLE_PRED
for SVE vector and predicate registers returned by reg_get_size()
. This needs to be converted to an OPSZ_
representing the SVE vector and predicate register size on the currently executing AArch64 implementation, i.e. one of:
OPSZ_128 OPSZ_256 OPSZ_384 OPSZ_512 OPSZ_640 OPSZ_768 OPSZ_896 OPSZ_1024 OPSZ_1152 OPSZ_1280 OPSZ_1408 OPSZ_1536 OPSZ_1664 OPSZ_1792 OPSZ_1920 OPSZ_2048
One approach would be for DynamoRIO to execute the RDVL
instruction (https://developer.arm.com/documentation/ddi0602/2022-06/SVE-Instructions/RDVL--Read-multiple-of-vector-register-size-to-scalar-register-) on startup so that when reg_get_size()
is called for SVE vector or predicate registers, the appropriate OPSZ_
is returned.
@derekbruening is this enough to cover all of DynamoRIO's clients/tools requirements? Is there another approach?
Having the actual register size in the operand in the IR at runtime does seem necessary but maybe not sufficient. Tools like DrMemory which are shadowing registers may want to know the maximum register sizes at initialization time to optimize their shadow allocations. Maybe an API routine like dr_mcontext_zmm_fields_valid() where a client can ask which parts of the mcontext are live for the current hardware, with some way to refer to the different simd vector sizes?
What about dr_simd_t
which it looks like is currently just 128 bits? Does it need to be increased to handle 2048 bits -- err, is your list from OPSZ_128 to OPSZ_2048 supposed to be /8 to get to bytes?
Thanks for the steer @derekbruening. Some more thoughts and questions...
What about
dr_simd_t
which it looks like is currently just 128 bits? Does it need to be increased to handle 2048 bits...
Yes. The low 128 bits of each SVE Z register overlap the corresponding Advanced SIMD (a.k.a NEON) registers.
Can we set the size of dr_simd_t
to the SVE maximum, 2048 bits, even though fewer bytes will be accessed most of the time? If so, this means we'll have to replace occurrences of sizeof(dr_simd_t)
with a runtime get size function for AArch64.
Maybe an API routine like
dr_mcontext_zmm_fields_valid()
where a client can ask which parts of the mcontext are live for the current hardware,
Is this a case of creating a new dr_mcontext_sve_fields_valid()
function which behaves like dr_mcontext_zmm_fields_valid()
?
with some way to refer to the different simd vector sizes?
Does the size have to be in the mcontext as well, for direct access from the context? Is get_reg_size()
at translation time insufficient?
err, is your list from OPSZ_128 to OPSZ_2048 supposed to be /8 to get to bytes?
Yes, twas just a quick brain dump of bit lengths!
Other thoughts:
We don't need the SVE equivalent of e.g.
define ZMM_ENABLED() (proc_avx512_enabled())
because all SVE h/w will have full SVE support in the OS.
In terms of testing, are the existing dynamorio and drmemory unit tests enough to guard against regression for such a size change? As part of the first patch I'll update tests/api/opnd-a64.c
and tests/client-interface/reg_size_test.dll.c
for real SVE h/w.
What about
dr_simd_t
which it looks like is currently just 128 bits? Does it need to be increased to handle 2048 bits...Yes. The low 128 bits of each SVE Z register overlap the corresponding Advanced SIMD (a.k.a NEON) registers. Can we set the size of
dr_simd_t
to the SVE maximum, 2048 bits, even though fewer bytes will be accessed most of the time?
One concern is that DR puts the mcontext on the stack in a number of places, and has a small stack size. 2048 bits == 256 bytes * 32 registers is 8K which is a lot on DR's small stacks. We've thought about heap-allocating in the past but use in signal handling complicates things. For comparison the x86 512-bit SIMD occupies 2K. I think we may have to move to heap allocation and special heap usage for signals if we add 8K.
If so, this means we'll have to replace occurrences of
sizeof(dr_simd_t)
with a runtime get size function for AArch64.
I think most copy routines already have logic to copy only certain SIMD fields so this should be doable. E.g., the x86 dr_mcontext_to_priv_mcontext has logic to copy just the ymm parts of the zmm fields.
Maybe an API routine like
dr_mcontext_zmm_fields_valid()
where a client can ask which parts of the mcontext are live for the current hardware,Is this a case of creating a new
dr_mcontext_sve_fields_valid()
function which behaves likedr_mcontext_zmm_fields_valid()
?
But also returning the size that's valid?
with some way to refer to the different simd vector sizes?
Does the size have to be in the mcontext as well, for direct access from the context? Is
get_reg_size()
at translation time insufficient?
A separate size query is probably ok.
In terms of testing, are the existing dynamorio and drmemory unit tests enough to guard against regression for such a size change? As part of the first patch I'll update
tests/api/opnd-a64.c
andtests/client-interface/reg_size_test.dll.c
for real SVE h/w.
Users must set the total struct size in the size field, so DR code would use that to handle both the old 128-bit and new larger sizes. One-off tests against binary clients built against the old would be one way to add confidence; a true regression test would probably require a duplicated struct def in the test or sthg.
What about
dr_simd_t
which it looks like is currently just 128 bits? Does it need to be increased to handle 2048 bits...Yes. The low 128 bits of each SVE Z register overlap the corresponding Advanced SIMD (a.k.a NEON) registers. Can we set the size of
dr_simd_t
to the SVE maximum, 2048 bits, even though fewer bytes will be accessed most of the time?One concern is that DR puts the mcontext on the stack in a number of places, and has a small stack size. 2048 bits == 256 bytes * 32 registers is 8K which is a lot on DR's small stacks. We've thought about heap-allocating in the past but use in signal handling complicates things. For comparison the x86 512-bit SIMD occupies 2K. I think we may have to move to heap allocation and special heap usage for signals if we add 8K.
In addition to the 32 vector registers (Z0-Z31), there are 16 predicate registers (P0-P15) and a first-fault register (FFR).
Current SVE hardware supports 128, 256 and 512 bit vectors. I suggest that we only implement for 512 bits max to start with, avoiding increasing stack size now. This will require just over 3K for Z, P and FFR registers. I think it's more efficient to get SVE patches rolling into the trunk now for 512 bits and address the larger vector and stack sizes as a separate issue later. Is that ok?
Maybe an API routine like
dr_mcontext_zmm_fields_valid()
where a client can ask which parts of the mcontext are live for the current hardware,Is this a case of creating a new
dr_mcontext_sve_fields_valid()
function which behaves likedr_mcontext_zmm_fields_valid()
?But also returning the size that's valid?
Yes.
What about
dr_simd_t
which it looks like is currently just 128 bits? Does it need to be increased to handle 2048 bits...Yes. The low 128 bits of each SVE Z register overlap the corresponding Advanced SIMD (a.k.a NEON) registers. Can we set the size of
dr_simd_t
to the SVE maximum, 2048 bits, even though fewer bytes will be accessed most of the time?One concern is that DR puts the mcontext on the stack in a number of places, and has a small stack size. 2048 bits == 256 bytes * 32 registers is 8K which is a lot on DR's small stacks. We've thought about heap-allocating in the past but use in signal handling complicates things. For comparison the x86 512-bit SIMD occupies 2K. I think we may have to move to heap allocation and special heap usage for signals if we add 8K.
In addition to the 32 vector registers (Z0-Z31), there are 16 predicate registers (P0-P15) and a first-fault register (FFR).
Current SVE hardware supports 128, 256 and 512 bit vectors. I suggest that we only implement for 512 bits max to start with, avoiding increasing stack size now. This will require just over 3K for Z, P and FFR registers. I think it's more efficient to get SVE patches rolling into the trunk now for 512 bits and address the larger vector and stack sizes as a separate issue later. Is that ok?
Yes, this sounds like a good plan to me.
Hi @derekbruening, @abhinav92003, do you know why the pushed a commit
messages from Fri Dec 16 (https://github.com/DynamoRIO/dynamorio/commit/8cf4b37796082cbb288d573225b30cae1ddd0cb0) onwards include @dolanzhao as the committer?
Hi @derekbruening, @abhinav92003, do you know why the
pushed a commit
messages from Fri Dec 16 (8cf4b37) onwards include @dolanzhao as the committer?
Hi @AssadHashmi, sorry for making you feel confused. The commits info issues you mentioned were caused when I merged the master branch to my working branch, and they will not affect the master branch. And I will clean these errors today. GitHub may delay removing the wrong info, and you will not see them in the future.
For vector element sizes: see discussion in #5681 and xref #5638 on adding element sizes for x86