crystal Optimize non-atomic memory allocation

Closes #14677 Closes #14678

This PR adds two new well-known functions used in the compiler:

__crystal_calloc64
__crystal_calloc_atomic64

These functions are analogues to __crystal_malloc64 and __crystal_malloc_atomic64, but they guarantee that any memory allocated using them is cleared. This can be used as an optimization as crystal mustn't clear this memory as required with memory allocated using the malloc versions.

If the __crystal_calloc* functions cannot be found, the old behaviour is used.

Additionally, two new GC methods calloc and calloc_atomic have been added with the same behaviour.

The description of the GC method malloc (which clears memory in bdwgc and doesn't clear memory with no GC) has been updated to reflech that it does not always clear any memory. Unless the underlying GC is changed, this is not a breaking change.

In the case of bdwgc, only non-atomic memory allocations got faster.

Code:

require "benchmark"

Benchmark.ips(calculation: 60) do |x|
  x.report("malloc") { Pointer(String).malloc(1) }
end

Benchmark.ips(calculation: 60) do |x|
  x.report("malloc") { Pointer(String).malloc(2 ** 10) }
end

Benchmark.ips(calculation: 60) do |x|
  x.report("malloc") { Pointer(String).malloc(2 ** 24) }
end

Results:

Bytesize	Before	After
8B	8.02ns	7.35ns
8KiB	824.02ns	746.20ns
128MiB	23.44ms	11.84ms

As can be seen from these results, large memory allocations profit a lot while small memory allocations only see a small improvement. Also, it may be interesting to see how often LLVM can remove the memset completely.

More advanced benchmarks must still be done.

Jun 09 '24 21:06 BlobCodes

A logical follow-up to this PR would be to expose a non-clearing variant of Pointer.malloc in the stdlib (ex. as Pointer.malloc_unsafe) to speed up collection types without inner pointers.

Jun 09 '24 21:06 BlobCodes

Nice speedup!

Though, I'm not sure about naming. The C calloc function involves two traits: allocate an array of n elements of size bytes then memory is set to zero, but we'd skip the main trait here.

I'd prefer to expose something more explicit, for example just GC.malloc(size, clear: true) and __crystal_malloc(size, clear: true) and same for the atomic versions.

Jun 10 '24 07:06 ysbaddaden

I'd prefer to expose something more explicit, for example just GC.malloc(size, clear: true) and __crystal_malloc(size, clear: true) and same for the atomic versions.

The __crystal_malloc* functions are funs, not defs, so we don't have named args. Adding a second arg is a breaking change since the new compiler couldn't use older stdlibs anymore.

Also, I don't really see why clear should be a param instead of a function invariant while the same isn't true for atomic (ex. GC.malloc(20, atomic: true)).

Though, I'm not sure about naming. The C calloc function involves two traits: allocate an array of n elements of size bytes then memory is set to zero, but we'd skip the main trait here.

The calloc function doesn't necessarily involve manually clearing (memset-ing) the allocated memory. For example, the Unix mmap syscall used for allocating large memory regions uses continuous 4KiB memory pages which are only actually commited to a program on access and always cleared on commit by the kernel itself. Since calloc can always assume this is true, no memory needs to be cleared and thus commited for large allocations.

The main invariant of calloc (the memory is cleared) is implemented - however that may be.

Jun 10 '24 14:06 BlobCodes