dynamorio icon indicating copy to clipboard operation
dynamorio copied to clipboard

API for getting lane width of x86 SIMD operands

Open khuey opened this issue 3 years ago • 4 comments

I have an application that tracks the flow of data through instructions. It would be useful for this to know the lane width of SIMD operands. This data already exists in source form in DynamoRIO (e.g. Xvd vs Xvs entries for operand size in the decoder table) but gets compiled out (because Xvd and Xvs are #defined to the same thing). It's not obvious where to put this: adding variations of the operand size (e.g. OPSZ_16_vex32_ps and OPSZ_16_vex32_pd) would be kind of disruptive. Any thoughts?

khuey avatar Sep 01 '22 18:09 khuey

For data flow tools (taint tracking, etc.), the full semantics of each complex SIMD opcode is also needed. Dr. Memory has to use dedicated code for each opcode to mirror its data movements: https://github.com/DynamoRIO/drmemory/blob/master/drmemory/slowpath_x86.c#L1110

For the x86 DR opcodes, does the lane width vary for the same opcode? I remember for aarch32 we split the opcodes so you could tell the lane from the opcode. For aarch64 I'm not sure of the status (@AssadHashmi may know more): SVE is in-progress; not as familiar with NEON aarch64 in DR, but I do recall a number of cases where tools would prefer DR's aarch64 decoder to split opcodes up further than it does.

If the lane width is fixed for each opcode, and you'd still need to dispatch on opcode for the data movement semantics, is the width helping much? I'm not saying DR should't provide it -- just trying to get a bigger picture for the use cases. Some kind of data movement by opcode library might be the ultimate solution?

derekbruening avatar Sep 02 '22 17:09 derekbruening

I'm not aware of any x86 opcodes that have multiple lane widths, although I wouldn't rule out Intel having snuck something in in the depths of AVX-512 without doing a thorough search. There are some weird things like the broadcast and permute instructions but most SIMD instructions don't allow any crossover between lanes. You'll see there's no code for e.g. vaddps in that drmemory code you linked.

That said simply encoding the lane widths along with the opcodes in my application is something I've considered. It's certainly not hard to do.

khuey avatar Sep 02 '22 17:09 khuey

One thing that is annoying about doing this separately is that there are opcodes where the lane widths are different on the different operands (e.g. cvtps2pd)

khuey avatar Sep 02 '22 23:09 khuey

Not sure I have any IR solution suggestions: a smaller operand size than the full SIMD register size means something else in DR's IR (xref opnd_create_reg_partial) -- it means the rest of the register is untouched, and so the size still refers to the total area affected by the operation. As you said adding to the size would require another dimension squished into it.

Adding a new field in opnd_t for a lane width with setters and getters might be a possibility? If SIMD registers never use union aux (if dr_opnd_flags_t never apply to SIMD) it could be put in there with no size increase, or in union value next to reg_id_t reg which is a short in a pointer-sized slot, if we want to avoid increasing the opnd_size_t size.

Otherwise it would have to be an externally-queried attribute, part of some SIMD library that maybe had dataflow semantics too as suggested.

derekbruening avatar Sep 07 '22 17:09 derekbruening

I stumbled across a related issue: there doesn't seem to be anything in DR that knows that the 12 bytes vcvtsi2ss takes from the first source operand are xmmN[4:16] and not xmmN[0:12]. Does that sound correct? Do none of the included analysis tools care about that?

khuey avatar Oct 18 '22 01:10 khuey

For the lane or element width: we were just discussing how to do this for AArch64 at https://github.com/DynamoRIO/dynamorio/pull/5681

derekbruening avatar Oct 18 '22 05:10 derekbruening

I stumbled across a related issue: there doesn't seem to be anything in DR that knows that the 12 bytes vcvtsi2ss takes from the first source operand are xmmN[4:16] and not xmmN[0:12]. Does that sound correct? Do none of the included analysis tools care about that?

I think that is correct and I think analysis tools are having to special-case that: sub-registers have no offset in the IR...

derekbruening avatar Oct 18 '22 05:10 derekbruening

For the lane or element width: we were just discussing how to do this for AArch64 at #5681

PR #5681 is adding a size field in opnd_t next to the reg value to hold the element size. It will be filled in only for AArch64 for now; this issue covers filling it in for x86. We should do AArch32 too.

derekbruening avatar Oct 18 '22 16:10 derekbruening

Great, I'll wait for that to land and then work something up. Is AArch64 handling the element type at all? I'm not familiar with the AArch64 architecture but x86 has both floating point and integer instructions of course.

khuey avatar Oct 18 '22 16:10 khuey

Great, I'll wait for that to land and then work something up. Is AArch64 handling the element type at all? I'm not familiar with the AArch64 architecture but x86 has both floating point and integer instructions of course.

Asking the folks working on AArch64: @joshua-warburton @AssadHashmi

derekbruening avatar Oct 18 '22 16:10 derekbruening

Great, I'll wait for that to land and then work something up. Is AArch64 handling the element type at all? I'm not familiar with the AArch64 architecture but x86 has both floating point and integer instructions of course.

Asking the folks working on AArch64: @joshua-warburton @AssadHashmi

AFAIK we can't establish element type without looking at the full instruction context.

AssadHashmi avatar Oct 18 '22 19:10 AssadHashmi

I'm looking at this now that the AArch64 work has landed. There's still not a great place to put this in the x86 decoder table unfortunately. instr_info_t provides two bytes per operand and only 3 bits (2 type bits and 1 operand size bit ) are still free (and TYPE_BEYOND_LAST_ENUM is pretty close to rolling over into one of those still free bits).

If we simply added new OPSZ types we could do just ps/pd with 6 values I believe (OPSZ_16, OPSZ_16_vex32, and OPSZ_16_vex32_evex64 with new ps/pd variants). Doing ss/sd and the various integer instructions would be a lot more work.

khuey avatar Oct 24 '22 02:10 khuey

Per https://github.com/DynamoRIO/dynamorio/issues/5638#issuecomment-1235754158 we could use a single value for the whole table entry but per-operand seems much nicer. Widening the type byte could be on the table. Are you saying we're almost out of space in the OPSZ_enum or you're saying there's a full bit meaning 128 values available?

derekbruening avatar Oct 24 '22 23:10 derekbruening