design
design copied to clipboard
Extracting all lanes from a v128
At the moment (as far as I see) the only way to extract more then one lane from a v128 value is to do the following:
(local $t v128)
...
local.tee $t
f32x4.extract_lane 0
local.get $t
f32x4.extract_lane 1
local.get $t
f32x4.extract_lane 2
local.get $t
f32x4.extract_lane 3
...
This seems quite inefficient, and clearly has a large storage footprint. I'm wondering whether there is a need of [f32x4,f64x2,...].extract_all.
Some things that would be useful to motivate this issue are:
- How often does this sequence occur in realistic applications?
- What CPU architectures have SIMD instructions that could be used to optimize
extract_allbetter than 4 separate extracts?