Support SIMD
Hello. Have you support for atomics and SIMD in Crystal language?
Hi. Not yet, but it's very likely we'll have these in the future
Changed title because Atomics are already implemented
Is anyone currently working on this? I would be happy to help if I can. Could you tell me what probably needs to be done and how complex it would be?
SIMD is supported at the optimization level (usually in loops) by LLVM natively, for explicit "I want to guarantee this will execute the exact SMD calls i specify" support, this will have to be supported in the codegen and language. LLVM represents this using vector types, and support for these will have to be added to crystal for basic mathematical operations on multiple values at once.
For more specialized architecture-specific SIMD instructions, you can create a shard wrapping inline assembler calls.
Thanks for the quick explanation! I'm pretty sure adding SIMD types to the language shouldn't be that hard because their implementation is probably very similar to the ones of the existing primitives (like Int32 or Float64). In fact, basic operations such as add work exactly the same way for vector and scalar types in LLVM.
I have no clue about assembly but I don't think it's a good idea to write native code for a language based on LLVM (which apparently has proper SIMD support).
After experimenting a bit I realized that LLVM is actually very smart when it comes to SIMD. For example, I can create two vectors of 32 single-precision floats and add them together like this:
%vp1 = alloca <32 x i32>
%v1 = load <32 x i32>, <32 x i32>* %vp1
%vp2 = alloca <32 x i32>
%v2 = load <32 x i32>, <32 x i32>* %vp2
%sum = add <32 x i32> %v1, %v2
Of course, such big vectors can't be stored inside any sort of registers. LLVM uses multiple registers to store these (at least I think so):
vmovdqa 256(%rsp), %ymm0
vmovdqa 288(%rsp), %ymm1
vmovdqa 320(%rsp), %ymm2
vmovdqa 352(%rsp), %ymm3
vpaddd 160(%rsp), %ymm1, %ymm1
vpaddd 192(%rsp), %ymm2, %ymm2
vpaddd 224(%rsp), %ymm3, %ymm3
vpaddd 128(%rsp), %ymm0, %ymm0
Please don't ask me how any of this works, I just see many different AVX registers being used subsequently. The size of the vector doesn't even have to be a power of two. We probably don't need to do any complicated checks because LLVM handles pretty much anything for us.
If you want, you can try experimenting with this. The implementation seems similar to the one for StaticArray, but instead of llvm array you must use LLVM vector. Searching for StaticArray in the compiler's source code could give you a path for doing it. It also seems something fun to experiment with.
I actually think vectors could be implemented mostly the same way as integers/floats/booleans. I mean, most IR instructions which work for scalars also work for vectors. For example, you can icmp two integer vectors and get a boolean vector of the same size containing the results of the individual comparisons. Arithmetic operations and casts work the same way, too.
Quick update: I wrote a very early implementation of a Vector(T, N) type (https://github.com/malte-v/crystal/tree/simd-support). Currently you can only get and set the individual elements, but it does use the SIMD registers. I'll add more functionalty soon.
I'm really not sure how the "front end" of the Vector type should look like. Currently it's like StaticArray where you specify the type and number of elements in the type arguments. However, this doesn't work well with the semantics a Vector should have.
This is because the arithmetic you can do with a vector depends on what its element type is. For example, you may want to AND two integer vectors, but that wouldn't make any sense for a vector of floats. It would be great to be able to do "generic specializations", as discussed in #3298.
A workaround I used at first was to define macros like
macro check_int
{% raise "Method #{@def.name}#{@def.args} is only supported for vectors of integers" unless [Int8, Int16, Int32, Int64, Int128, UInt8, UInt16, UInt32, UInt64, UInt128].includes? T %}
end
and then call them in methods that should only be available to some vectors. But this doesn't work with @[Primitive(...)] methods because these can't have a body.
Any ideas how to overcome this? I personally would like to keep the generic Vector(T, N) type because it seems logical and natural to me.
Maybe make the primitive methods private and add the check for public methods?
@asterite Thanks, weird thing I didn't come up with this by myself.
What about specialized structs like Int32::Vector(N) and Float64::Vector(N)? With simple macro constructors like:
Int32.vector(1, 2, 3) => Int32::Vector(3)
It would avoid the issue of manually raising (let the compiler do its job) and allow a nicer API for int or float specific calls, for example Int32::Vector(N)#check instead of Vector(T, N)#check_int.
@ysbaddaden I really like that approach. It lets us omit all the "hacky" parts of the aforementioned implementation while also emphasizing the element type. I'll change the implementation to this.
I need some more help with the implementation. I'm trying to add the vector structs to the built-in types in program.cr. The key part is:
types["Bool::Vector"] = vec_bool = @vec_bool = BoolVectorType.new self, bool, "Vector", value, ["N"], bool
vec_bool.struct = true
vec_bool.can_be_stored = false
This is also goes for the integer and float types. The arguments for the BoolVectorType constructor are the program, the namespace, the type name, the base class, the generic type arguments and the element type which is equivalent to the namespace. From my understanding this should match:
# <top level>
struct SomeVectorElementType
struct Vector(N)
...
end
end
But this is not the case. Unless I'm explicitly requiring vectors.cr, I get undefined constant Int32::Vector. How should I add these sub-types to the Program?
You need to do:
types["Bool"].types["Vector"] = vec_bool = ...
Types contain other types. Like in code.
Or just bool.types["Vector"] = ... if bool was defined in a previous line.
Though namespacing things like this makes little sense. Probably IntVector, BoolVector, FloatVector is better.
In any case, without a concrete use case I don't know why we'd like to eventually merge this.
Thanks, works fine now.
I'd use vectors for CG, but SIMD is mostly useful when performing the same arithmetic on lots of data over and over again. Of course you can cross your fingers and hope the optimizer does its job, but I like to be sure my code is properly optimized. Either way, I think vectors are nice to have just for the sake of convenience (no need to loop over arrays).
SIMD is useful whenever there are operations to repeat over a set of values. Instead of manually operating on each value, you can use a single instruction to operate on many at once, supposedly speeding things up (at least since SSE).
It can be useful in geographical operations (k-means) over 2d or 3d points, to project polygons on a map, matrix transformations, applying transformations to rgba pixels... It can also be used in PRNG and digest calculations, see for example https://github.com/lemire/simdpcg
FYI, vector arithmetic and conversions are now working fine. I was able to reuse all the codegen logic from the scalar operations. However, boolean vectors don't work at all, every element gets printed as false. Still figuring out what's wrong there.
I'm not sure how the ElementType::Vector(N) architecture would work with generic types using Vector. For example:
class Matrix(T, N, M) # T = element type; N = no. of rows; M = no. of columns
@rows : T::Vector(M)[N] # undefined constant T::Vector
@rows : Vector(T, M)[N] # works
end
I'd like to hear some thoughts on https://github.com/crystal-lang/crystal/issues/3298#issuecomment-500363243
@malte-v Given that Int::Vector and Float::Vector have different operations, I can imagine there being Int::Matrix and Float::Matrix because of the same reason...
@asterite I don't think having 12 different matrix classes makes sense because no one would ever AND two matrices or use some other int-specific operation. Using macro loops here will produce duplicate code.
edit: Wording
I am curious what is the status on this?
I wrote a Vector struct to use in raylib once, and i was inspired by how crystal's stdlib for "Complex" numbers changed the Numbers class to allow to write complexs numbers like 2 + 1.i.
So i wrote it so you could do something like 1.x + 1.y to write Vector[1,1], similarly how you would use base-vectors in math. here is a sample with the class and some operations for upto 4-dimensional vectors
https://play.crystal-lang.org/#/r/ehgt
I have been working on this myself to the point where I could port lemire/fastvalidate-utf-8 to Crystal natively (note: simdutf is the production-ready choice that supports way more CPUs). You could take a look at the proof-of-concept branch here: https://github.com/HertzDevil/crystal/tree/feature/vector
A brief overview:
-
SIMD::IntVector(T, N),SIMD::FloatVector(T, N),SIMD::BoolVector(T, N), andSIMD::PointerVector(T, N)are the actual vector types and a direct correspondence to the underlying LLVM vector types.Tis the element type, which must beBoolforBoolVector, for consistency.Nis a positive integer, but otherwise there are no restrictions. LLVM may or may not break on any given platform where a vector type doesn't correspond to an actual register, e.g.SIMD::FloatVector(Float32, 16)on x86-64 without AVX512, but if a correspondence exists, that vector type is guaranteed to abide by the ABI. -
SIMD::Vector(T, N)is the common module shared by those vector types. Following the recent work onReferenceStorage, none of these type names are hardcoded in the compiler. -
.literalis the constructor for vectors. Arguments must have identical types, no autocasting happens and generic type arguments cannot be supplied yet. If all arguments are literals, the resulting value is also a single LLVM vector literal. (Fun fact: the implementation forSlice.literaloriginated from this) -
#unsafe_fetch,#unsafe_copy_with, and#unsafe_shufflecorrespond to LLVM'sextractelement,insertelement, andshufflevectorinstructions. Forshufflevectorthe mask must be a constant literal; non-constant masks are only possible with platform-specific intrinsics, such as this proof-of-concept'sIntrinsics::X86.mm_shuffle_epi8. -
.castperforms bitcasts between vector types with the same total bit sizes, e.g. fromSIMD::IntVector(Int8, 32)toSIMD::FloatVector(Float32, 8)orSIMD::BoolVector(Bool, 256). This is similar to#unsafe_as, but no pointers are involved because the cast may take place entirely inside a register. Bitcasts between pointer and non-pointer vectors are deliberately left undefined because pointers do not have a platform-independent fixed size. -
.unaligned_loadand#unaligned_storeare analogs ofPointer(T)#valueand#value=, except the former pair assumes byte alignment, whereas the latter pair follows that ofT. They are necessary since e.g. x86-64MOVAPSsegfaults on misaligned addresses, but they are not exclusive to vectors per se; any overaligned type will benefit from this. - For simplicity, all vector-vector operations are implemented with a new
vector_zipprimitive, which requires the operand types to be identical. Many can be made to usebinaryinstead with some effort; others map to LLVM intrinsics. - The
vector_reduceprimitive is for reduction intrinsics defined by LLVM, likeBoolVector#all?. -
samples/simd/intrinsics/**define the Intel intrinsics in terms of the vector types. They are a mixture of vector operations, target-specific LLVM intrinsics, and Clang compiler built-ins, all ported from Clang'sx86intrin.hfamily. Most notably, none of them require inline assembly. Although not strictly necessary, they facilitate porting SIMD code in C that invokes the Intel intrinsics directly (which is the majority of them). - Presence of a specific CPU feature must be explicitly declared. This proof-of-concept currently handles
x86_has_sse41andx86_has_avx2. #14524 is supposed to take care of this in the compiler itself. -
SIMD::Vectordeliberately does not includeIndexable, even though vectors are seemingly compatible with immutableStaticArrays. If your vectors are used like that, chances are the right tools are not being used for the right tasks, and your code won't benefit from data parallelism. - Notable features absent from the proof-of-concept: scalable vectors (e.g. for AArch64 SVE), predicated intrinsics (e.g. for AVX512), type-indepedent
#==between vectors (this requires changingvector_ziptobinaryas mentioned above), overflow-checked arithmetic, correctly aligned allocations, C ABI support.
samples/simd/utf8_valid_encoding.cr is the actual benchmark code for different implementations of String#valid_encoding?, parsing 1000 random 65536-byte valid UTF-8 strings. Run it with samples/simd/run_benchmarks.sh on an x86-64 machine with AVX2 support to see the speedup in action:
Baseline:
old 4.46 (224.14ms) (± 0.17%) 0.0B/op 8.25× slower
scalar 36.79 ( 27.18ms) (± 0.12%) 0.0B/op fastest
x86-64-v2:
old 4.51 (221.85ms) (± 0.18%) 0.0B/op 18.23× slower
scalar 36.79 ( 27.18ms) (± 0.14%) 0.0B/op 2.23× slower
sse4.1 82.16 ( 12.17ms) (± 1.44%) 0.0B/op fastest
x86-64-v3:
old 4.35 (229.86ms) (± 0.55%) 0.0B/op 26.51× slower
scalar 56.95 ( 17.56ms) (± 0.27%) 0.0B/op 2.03× slower
sse4.1 83.78 ( 11.94ms) (± 0.58%) 0.0B/op 1.38× slower
avx2 115.33 ( 8.67ms) (± 1.40%) 0.0B/op fastest
scalar is the stdlib implementation since #12145, old is the one before. The current implementation actually relies on the BMI2 extension from x86-64-v3 for the best performance. The AVX2 version in turn is 2-3 times as fast as that. (This is insufficient for #11873 as we most likely also want a faster #each_char.)
I would like to hear some initial feedback first; if we agree this is the general direction we want to take, I'll submit an RFC later and then we could sort out all the details.
I love it, awesome work @HertzDevil :heart_hands:
Starting an RFC sounds like a good idea.