crystal Support SIMD

Hello. Have you support for atomics and SIMD in Crystal language?

Jul 27 '16 15:07 ghost

Hi. Not yet, but it's very likely we'll have these in the future

Jul 27 '16 15:07 asterite

Changed title because Atomics are already implemented

Sep 26 '17 19:09 asterite

Is anyone currently working on this? I would be happy to help if I can. Could you tell me what probably needs to be done and how complex it would be?

May 12 '19 13:05 malte-v

SIMD is supported at the optimization level (usually in loops) by LLVM natively, for explicit "I want to guarantee this will execute the exact SMD calls i specify" support, this will have to be supported in the codegen and language. LLVM represents this using vector types, and support for these will have to be added to crystal for basic mathematical operations on multiple values at once.

For more specialized architecture-specific SIMD instructions, you can create a shard wrapping inline assembler calls.

May 13 '19 22:05 RX14

Thanks for the quick explanation! I'm pretty sure adding SIMD types to the language shouldn't be that hard because their implementation is probably very similar to the ones of the existing primitives (like Int32 or Float64). In fact, basic operations such as add work exactly the same way for vector and scalar types in LLVM.

I have no clue about assembly but I don't think it's a good idea to write native code for a language based on LLVM (which apparently has proper SIMD support).

May 17 '19 20:05 malte-v

After experimenting a bit I realized that LLVM is actually very smart when it comes to SIMD. For example, I can create two vectors of 32 single-precision floats and add them together like this:

%vp1 = alloca <32 x i32>
%v1 = load <32 x i32>, <32 x i32>* %vp1

%vp2 = alloca <32 x i32>
%v2 = load <32 x i32>, <32 x i32>* %vp2

%sum = add <32 x i32> %v1, %v2

Of course, such big vectors can't be stored inside any sort of registers. LLVM uses multiple registers to store these (at least I think so):

vmovdqa	256(%rsp), %ymm0
vmovdqa	288(%rsp), %ymm1
vmovdqa	320(%rsp), %ymm2
vmovdqa	352(%rsp), %ymm3
vpaddd	160(%rsp), %ymm1, %ymm1
vpaddd	192(%rsp), %ymm2, %ymm2
vpaddd	224(%rsp), %ymm3, %ymm3
vpaddd	128(%rsp), %ymm0, %ymm0

Please don't ask me how any of this works, I just see many different AVX registers being used subsequently. The size of the vector doesn't even have to be a power of two. We probably don't need to do any complicated checks because LLVM handles pretty much anything for us.

May 18 '19 13:05 malte-v

If you want, you can try experimenting with this. The implementation seems similar to the one for StaticArray, but instead of llvm array you must use LLVM vector. Searching for StaticArray in the compiler's source code could give you a path for doing it. It also seems something fun to experiment with.

May 18 '19 14:05 asterite

I actually think vectors could be implemented mostly the same way as integers/floats/booleans. I mean, most IR instructions which work for scalars also work for vectors. For example, you can icmp two integer vectors and get a boolean vector of the same size containing the results of the individual comparisons. Arithmetic operations and casts work the same way, too.

May 18 '19 16:05 malte-v

Quick update: I wrote a very early implementation of a Vector(T, N) type (https://github.com/malte-v/crystal/tree/simd-support). Currently you can only get and set the individual elements, but it does use the SIMD registers. I'll add more functionalty soon.

May 19 '19 18:05 malte-v

I'm really not sure how the "front end" of the Vector type should look like. Currently it's like StaticArray where you specify the type and number of elements in the type arguments. However, this doesn't work well with the semantics a Vector should have. This is because the arithmetic you can do with a vector depends on what its element type is. For example, you may want to AND two integer vectors, but that wouldn't make any sense for a vector of floats. It would be great to be able to do "generic specializations", as discussed in #3298. A workaround I used at first was to define macros like

macro check_int
  {% raise "Method #{@def.name}#{@def.args} is only supported for vectors of integers" unless [Int8, Int16, Int32, Int64, Int128, UInt8, UInt16, UInt32, UInt64, UInt128].includes? T %}
end

and then call them in methods that should only be available to some vectors. But this doesn't work with @[Primitive(...)] methods because these can't have a body. Any ideas how to overcome this? I personally would like to keep the generic Vector(T, N) type because it seems logical and natural to me.

May 21 '19 14:05 malte-v

Maybe make the primitive methods private and add the check for public methods?

May 21 '19 14:05 asterite

@asterite Thanks, weird thing I didn't come up with this by myself.

May 21 '19 14:05 malte-v

What about specialized structs like Int32::Vector(N) and Float64::Vector(N)? With simple macro constructors like:

Int32.vector(1, 2, 3) => Int32::Vector(3)

It would avoid the issue of manually raising (let the compiler do its job) and allow a nicer API for int or float specific calls, for example Int32::Vector(N)#check instead of Vector(T, N)#check_int.

May 22 '19 08:05 ysbaddaden

@ysbaddaden I really like that approach. It lets us omit all the "hacky" parts of the aforementioned implementation while also emphasizing the element type. I'll change the implementation to this.

May 22 '19 11:05 malte-v

I need some more help with the implementation. I'm trying to add the vector structs to the built-in types in program.cr. The key part is:

types["Bool::Vector"] = vec_bool = @vec_bool = BoolVectorType.new self, bool, "Vector", value, ["N"], bool
vec_bool.struct = true
vec_bool.can_be_stored = false

This is also goes for the integer and float types. The arguments for the BoolVectorType constructor are the program, the namespace, the type name, the base class, the generic type arguments and the element type which is equivalent to the namespace. From my understanding this should match:

# <top level>
struct SomeVectorElementType
  struct Vector(N)
    ...
  end
end

But this is not the case. Unless I'm explicitly requiring vectors.cr, I get undefined constant Int32::Vector. How should I add these sub-types to the Program?

May 28 '19 16:05 malte-v

You need to do:

types["Bool"].types["Vector"] = vec_bool = ...

Types contain other types. Like in code.

May 28 '19 16:05 asterite

Or just bool.types["Vector"] = ... if bool was defined in a previous line.

May 28 '19 16:05 asterite

Though namespacing things like this makes little sense. Probably IntVector, BoolVector, FloatVector is better.

In any case, without a concrete use case I don't know why we'd like to eventually merge this.

May 28 '19 16:05 asterite

Thanks, works fine now.

I'd use vectors for CG, but SIMD is mostly useful when performing the same arithmetic on lots of data over and over again. Of course you can cross your fingers and hope the optimizer does its job, but I like to be sure my code is properly optimized. Either way, I think vectors are nice to have just for the sake of convenience (no need to loop over arrays).

May 28 '19 17:05 malte-v

SIMD is useful whenever there are operations to repeat over a set of values. Instead of manually operating on each value, you can use a single instruction to operate on many at once, supposedly speeding things up (at least since SSE).

It can be useful in geographical operations (k-means) over 2d or 3d points, to project polygons on a map, matrix transformations, applying transformations to rgba pixels... It can also be used in PRNG and digest calculations, see for example https://github.com/lemire/simdpcg

May 28 '19 20:05 ysbaddaden

FYI, vector arithmetic and conversions are now working fine. I was able to reuse all the codegen logic from the scalar operations. However, boolean vectors don't work at all, every element gets printed as false. Still figuring out what's wrong there.

Jun 03 '19 12:06 malte-v

I'm not sure how the ElementType::Vector(N) architecture would work with generic types using Vector. For example:

class Matrix(T, N, M) # T = element type; N = no. of rows; M = no. of columns
  @rows : T::Vector(M)[N] # undefined constant T::Vector
  @rows : Vector(T, M)[N] # works
end

I'd like to hear some thoughts on https://github.com/crystal-lang/crystal/issues/3298#issuecomment-500363243

Jun 14 '19 11:06 malte-v

@malte-v Given that Int::Vector and Float::Vector have different operations, I can imagine there being Int::Matrix and Float::Matrix because of the same reason...

Jun 14 '19 12:06 asterite

@asterite I don't think having 12 different matrix classes makes sense because no one would ever AND two matrices or use some other int-specific operation. Using macro loops here will produce duplicate code.

edit: Wording

Jun 14 '19 12:06 malte-v

I am curious what is the status on this?

Oct 19 '21 16:10 jkthorne

I wrote a Vector struct to use in raylib once, and i was inspired by how crystal's stdlib for "Complex" numbers changed the Numbers class to allow to write complexs numbers like 2 + 1.i. So i wrote it so you could do something like 1.x + 1.y to write Vector[1,1], similarly how you would use base-vectors in math. here is a sample with the class and some operations for upto 4-dimensional vectors https://play.crystal-lang.org/#/r/ehgt

Jan 30 '23 22:01 nicolaslanzoni

I have been working on this myself to the point where I could port lemire/fastvalidate-utf-8 to Crystal natively (note: simdutf is the production-ready choice that supports way more CPUs). You could take a look at the proof-of-concept branch here: https://github.com/HertzDevil/crystal/tree/feature/vector

A brief overview:

SIMD::IntVector(T, N), SIMD::FloatVector(T, N), SIMD::BoolVector(T, N), and SIMD::PointerVector(T, N) are the actual vector types and a direct correspondence to the underlying LLVM vector types. T is the element type, which must be Bool for BoolVector, for consistency. N is a positive integer, but otherwise there are no restrictions. LLVM may or may not break on any given platform where a vector type doesn't correspond to an actual register, e.g. SIMD::FloatVector(Float32, 16) on x86-64 without AVX512, but if a correspondence exists, that vector type is guaranteed to abide by the ABI.
SIMD::Vector(T, N) is the common module shared by those vector types. Following the recent work on ReferenceStorage, none of these type names are hardcoded in the compiler.
.literal is the constructor for vectors. Arguments must have identical types, no autocasting happens and generic type arguments cannot be supplied yet. If all arguments are literals, the resulting value is also a single LLVM vector literal. (Fun fact: the implementation for Slice.literal originated from this)
#unsafe_fetch, #unsafe_copy_with, and #unsafe_shuffle correspond to LLVM's extractelement, insertelement, and shufflevector instructions. For shufflevector the mask must be a constant literal; non-constant masks are only possible with platform-specific intrinsics, such as this proof-of-concept's Intrinsics::X86.mm_shuffle_epi8.
.cast performs bitcasts between vector types with the same total bit sizes, e.g. from SIMD::IntVector(Int8, 32) to SIMD::FloatVector(Float32, 8) or SIMD::BoolVector(Bool, 256). This is similar to #unsafe_as, but no pointers are involved because the cast may take place entirely inside a register. Bitcasts between pointer and non-pointer vectors are deliberately left undefined because pointers do not have a platform-independent fixed size.
.unaligned_load and #unaligned_store are analogs of Pointer(T)#value and #value=, except the former pair assumes byte alignment, whereas the latter pair follows that of T. They are necessary since e.g. x86-64 MOVAPS segfaults on misaligned addresses, but they are not exclusive to vectors per se; any overaligned type will benefit from this.
For simplicity, all vector-vector operations are implemented with a new vector_zip primitive, which requires the operand types to be identical. Many can be made to use binary instead with some effort; others map to LLVM intrinsics.
The vector_reduce primitive is for reduction intrinsics defined by LLVM, like BoolVector#all?.
samples/simd/intrinsics/** define the Intel intrinsics in terms of the vector types. They are a mixture of vector operations, target-specific LLVM intrinsics, and Clang compiler built-ins, all ported from Clang's x86intrin.h family. Most notably, none of them require inline assembly. Although not strictly necessary, they facilitate porting SIMD code in C that invokes the Intel intrinsics directly (which is the majority of them).
Presence of a specific CPU feature must be explicitly declared. This proof-of-concept currently handles x86_has_sse41 and x86_has_avx2. #14524 is supposed to take care of this in the compiler itself.
SIMD::Vector deliberately does not include Indexable, even though vectors are seemingly compatible with immutable StaticArrays. If your vectors are used like that, chances are the right tools are not being used for the right tasks, and your code won't benefit from data parallelism.
Notable features absent from the proof-of-concept: scalable vectors (e.g. for AArch64 SVE), predicated intrinsics (e.g. for AVX512), type-indepedent #== between vectors (this requires changing vector_zip to binary as mentioned above), overflow-checked arithmetic, correctly aligned allocations, C ABI support.

samples/simd/utf8_valid_encoding.cr is the actual benchmark code for different implementations of String#valid_encoding?, parsing 1000 random 65536-byte valid UTF-8 strings. Run it with samples/simd/run_benchmarks.sh on an x86-64 machine with AVX2 support to see the speedup in action:

Baseline:
   old   4.46  (224.14ms) (± 0.17%)  0.0B/op   8.25× slower
scalar  36.79  ( 27.18ms) (± 0.12%)  0.0B/op        fastest
x86-64-v2:
   old   4.51  (221.85ms) (± 0.18%)  0.0B/op  18.23× slower
scalar  36.79  ( 27.18ms) (± 0.14%)  0.0B/op   2.23× slower
sse4.1  82.16  ( 12.17ms) (± 1.44%)  0.0B/op        fastest
x86-64-v3:
   old   4.35  (229.86ms) (± 0.55%)  0.0B/op  26.51× slower
scalar  56.95  ( 17.56ms) (± 0.27%)  0.0B/op   2.03× slower
sse4.1  83.78  ( 11.94ms) (± 0.58%)  0.0B/op   1.38× slower
  avx2 115.33  (  8.67ms) (± 1.40%)  0.0B/op        fastest

scalar is the stdlib implementation since #12145, old is the one before. The current implementation actually relies on the BMI2 extension from x86-64-v3 for the best performance. The AVX2 version in turn is 2-3 times as fast as that. (This is insufficient for #11873 as we most likely also want a faster #each_char.)

I would like to hear some initial feedback first; if we agree this is the general direction we want to take, I'll submit an RFC later and then we could sort out all the details.

May 21 '24 12:05 HertzDevil

I love it, awesome work @HertzDevil :heart_hands:

Starting an RFC sounds like a good idea.

May 21 '24 12:05 straight-shoota