unboxables
unboxables copied to clipboard
Integrating with SB-SIMD
Since we now have packed arrays of structures, it should perhaps be possible to use SIMD to operate on these for going maximum brrrrrr. SB-SIMD became a contrib a while ago thanks to @marcoheisig; it provides an interface for SIMD instructions on SBCL. Will it be possible to somehow leverage it for use with unboxables?
One obstacle is that what we have here is not SIMD packs, or CL specialized vectors, but raw pointers with some external type information and possibly heterogenous data inside. Can SB-SIMD be somehow used with these?
Right now, sb-simd only has instructions for loading data from vectors, not from raw memory. But adding these instructions is a piece of cake.
As usual, the biggest challenge is naming things. How do you like sap-ref-X.Y for the function that loads a pack of type X.Y from a supplied foreign pointer and offset?
Glad to hear the first part. I'm no good at naming and I'm no a specialist in SB-SIMD so I won't be able to say anything meaningful about naming though.
OK, I'll try to get this into sbcl 2.3.5.
The C world would be happy with untyped SIMD packs, but it wouldn't fit well with lisp.
Will we generate a distinct sap-ref-X.Y for every unboxable X and pack-fitting Y? It doesn't sound too bad. But does that also call for a distinct X.Y-values and make-X.Y for each case? We could go with a cffi:mem-ref like convention that takes in a type as an argument, but details need more working out.
Also, are there any particular fixed set of operations (looping while providing access) that we intend to provide? If so, we could design the primitives centered around those operations, but otherwise, more general primitives do seem necessary.
Will we generate a distinct sap-ref-X.Y for every unboxable X and pack-fitting Y?
The X.Y would be for each element type abbreviation X (f32, f64, u8, ..., s64), and for each corresponding pack width Y. The rule is that the number of bits in X, multiplied by Y, must be the size of a SIMD register (either 128 or 256, and maybe 512 one day). With ten different element types X, two kinds of Y each, both loads and stores, and both regular and non-temporal variants, this makes for 10 x 2 x 2 x 2 = 80 functions. Plus another 40 functions (with 128 bit only) for SSE. Luckily, sb-simd can generate all these variants from a table :)
Also, are there any particular fixed set of operations (looping while providing access) that we intend to provide?
I started working on a library for this some time ago: https://github.com/marcoheisig/Loopus
I think what users want is to write the non-SIMD loop only, and then use an auto-vectorizing macro. Unfortunately, I never finished Loopus. Maybe this is something I could work on for next year's SBCL developer meeting :thinking:
A naïve non-CL question from a non-SIMD developer: let's assume that we have a structure comprised of a f32 and a ub64, so heterogenous data, 96 bits in total. Is it somehow possible to efficiently operate on the float parts only via some sort of masking?
@phoe Sure, sb-simd has masking operations. But if you only have a single structure instance with a single value, SIMD won't do much for you.
Things get interesting once you have more than one instance though. In that case, everything depends on the memory layout. If you have an array of structures layout then things are somewhat awkward, but if you have a structure of arrays layout then you'll have one array containing only the values of the first slots, followed by one array of values of the second slot, and so on. In the latter case, SIMD works really well.
Welp, apologies; rather than a structure of arrays (which is the trivial case) I meant an array of 96-bit structures f32 + ub64.
This example would mean that a 256-bit register needs to operate only on the 1st, 4th, 7th 32-bit chunk, then only 2nd, 5th, 8th, then only 3rd and 6th before wrapping over. I assume that this is possible with masking and this still gives some sort of non-negligible speedup over simple iteration?
Even without masking, there are gather/scatter instructions that allow loading from / storing to offsets from a base address. I'm not sure about the speedup numbers though.
The X.Y would be for each element type abbreviation X (f32, f64, u8, ..., s64), and for each corresponding pack width Y.
Ah, wait, so was the plan to generate these instructions for cases where X can be, say, point or triplet, or no such plans?
I started working on a library for this some time ago: https://github.com/marcoheisig/Loopus
Oh yes, you had mentioned it in your paper the previous year. Unfortunately I haven't had the time to dive into it, and I'm stuck in that "20% of the things provide 80% of functionality" local optima that can be a bane for lisp.
Ah, wait, so was the plan to generate these instructions for cases where X can be, say, point or triplet, or no such plans?
I don't know if a point or triplet can be SIMD-settable. I assume it would be the job of some sort of unboxables-simd system to compile some higher-level operations on user-defined types into sb-simd calls on the primitive constituents of that type.
Oh, okay! I think I misunderstood somethings above.
Yup, okay, seems like the issue can be resolved by (i) having SB-SIMD functions that work on foreign pointers, and (ii) providing a LOOPUS or equivalent functionality.
A more general idea on the wishlist is for SIMD support in CFFI itself.
Even without masking, there are gather/scatter instructions that allow loading from / storing to offsets from a base address.
I haven't yet added gather/scatter instructions to sb-simd, because they are also not much faster than just emitting one load per element and then packing those values. The resulting CPU microcode is probably almost identical.
Yup, okay, seems like the issue can be resolved by (i) having SB-SIMD functions that work on foreign pointers, and (ii) providing a LOOPUS or equivalent functionality.
If you --- or anyone else --- wants to help with Loopus or start a related project, don't hesitate to contact me. I also wrote https://github.com/marcoheisig/cl-isl so that one can use the same secret sauce as GCC or Clang when it comes to optimizing loop nests.
A more general idea on the wishlist is for SIMD support in CFFI itself.
My original plan was to write a library called cl-simd to offer SIMD support for all (well, most) implementations. But given the daunting challenge of adding all those 1000+ instructions to half a dozen implementations, I decided to go for SBCL first. My hope is that this sets a de-facto standard that will be adopted by other implementation, so that a cl-simd portability library will emerge eventually. Once there is such a portability library, it would be straightforward to make CFFI use it. I really wouldn't want some #+sb-simd statements directly in CFFI, out of fear that I'd have to take on some of that maintenance burden.