eio
eio copied to clipboard
Decide on cstruct vs bytes
Performing IO operations with the kernel requires giving the kernel the address of a buffer that the GC won't move during the operation. To ensure that, we use Cstruct.t everywhere. However, there are rumours (https://github.com/mirage/ocaml-cohttp/pull/819#discussion_r780305510) that regular strings in OCaml 5 are sure to stay put, at least once they're in the major heap.
- Using regular strings/bytes is likely to be slightly faster than a cstruct (which wraps a bigarray).
- It would integrate better with other OCaml APIs.
- We would have to ensure that a buffer is on the major heap (and move it if not) when doing IO. This would involve extra copying if not careful.
- There are severe length restrictions on strings on 32-bit platforms, as the top bits of the length field are used for other things.
- Only cstructs work with memory-mapped data.
Tasks:
- [ ] Confirm that OCaml 5 guarantees that strings will not be moved.
- [ ] Measure performance difference of strings vs cstructs.
/cc @avsm @kayceesrk
The non-relocating guarantee of strings in multicore OCaml is implementation-defined, and not part of the language spec. So they aren't guaranteed to not move in the future, but for now there is no compactor in 5.0+trunk and no plans for one to show up...
Also another important point is that only values over a certain size are malloc-allocated and not stuck in the minor heap (i.e. similar to the direct major heap allocation in ocaml 4.x). So APIs will need to be careful of that if they use string.
What @avsm says is correct except for a small clarfification:
malloc-allocated and not stuck in the minor heap
Any object larger than Config.max_young_wosize gets allocated in the major heap, which is not the same as malloc-allocated. Major allocator allocates small objects in its domain-local pools, and only allocates large objects using malloc; large in the major heap is > 128 words.
If you can ensure that your object gets allocated in the major heap, then in the current implementation, irrespective of its size, the object won't move.
Thanks for the clarifications. Thinking more about this, we'd still want something like Cstruct to select a slice over some bytes, so maybe Cstruct itself should change to be backed by bytes?
It's a shame there isn't a way to make bytes point to external memory. Then we could use a single representation for everything, including memory-mapped IO.
@stedolan points me at an experiment he did that might help: https://gist.github.com/stedolan/318f87db9f59f1acea771e7f4dd59cd4
To quote direct communication:
the Pinbuf module in that file has an alloc function that returns Bigstrings that can be used and GC'd as normal, but if you happen to call Pinbuf.release when you're done with a Bigstring it'll perform better (even if you only do it for some of the bigstrings you alloc) on my laptop:
$ ocamlopt -g unix.cmxa pinbuf.ml pinbuf_stubs.c -o pb && ./pb time buffers minor GCs bigstring: 1.9s -1 19531 pinbuf (gc): 1.4s 1039 249 pinbuf (half-gc): 1.3s 545 246 pinbuf (manual): 1.1s 1 192that's including a full memset to the buffer to simulate some amount of work. Without that, and just doing raw alloc / release, it does:
time buffers minor GCs bigstring: 0.8s -1 19531 pinbuf (gc): 0.1s 1039 249 pinbuf (half-gc): 0.1s 545 246 pinbuf (manual): 0.1s 1 192(the tradeoff here is that memory allocated to pinbufs is cached forever and can't be reused for other things or even other pinbuf sizes, unlike bigstrings)
So this is a pool of bigarrays, but if you forget to return one then the system will notice after a while and return it anyway?
It looks like the drawback is that when something says it's returning a buffer you can't be sure it really isn't going to use it again, which could provide for some fun debugging / data leaks. Also, everything has to be the same size. So it would work within a module for private data, but it's not a general purpose solution.
In an attempt to count the number of characters in file using eio, I stumbled on buffer access times. So, here I am checking the performance difference (in time) between Bytes, Cstruct and Bigarray using this code. In this experiment which was run on my machine, I created a buffer and filled it with same character. Later I am trying to access each character in the buffer and measuring the time to do so.
Result shows that, with different buffer size, Bytes performs much better than remaining types after every iteration. It is followed by Cstruct and then Bigarray. Same trend follows for multiple executions of this experiment.
$ ./_build/default/bytesCstructBigarray.exe 4096
Time Difference for Bytes 0.000019
Time Difference for Cstruct_str 0.000065
Time Difference for Cstruct_ba 0.000065
Time Difference for bigarray 0.000108
$ ./_build/default/bytesCstructBigarray.exe 8196
Time Difference for Bytes 0.000042
Time Difference for Cstruct_str 0.000142
Time Difference for Cstruct_ba 0.000141
Time Difference for bigarray 0.000239
$ ./_build/default/bytesCstructBigarray.exe 16384
Time Difference for Bytes 0.000103
Time Difference for Cstruct_str 0.000355
Time Difference for Cstruct_ba 0.000354
Time Difference for bigarray 0.000596
$ ./_build/default/bytesCstructBigarray.exe 32768
Time Difference for Bytes 0.000206
Time Difference for Cstruct_str 0.000711
Time Difference for Cstruct_ba 0.000689
Time Difference for bigarray 0.001415
Thanks @deepali2806 - this is very interesting! I see the same thing on my machine. What's really odd is that cstruct uses bigarray internally, and in the same way, so it's very surprising that it's faster!
I tried a few things. This version is fast:
let rec countInBufferCstruct buffer noOfBytes init =
if noOfBytes = 0
then init+1
else
let b = Char.code (Bigarray.Array1.unsafe_get buffer.Cstruct.buffer noOfBytes) in
if b < 128 && b >= 0
then
countInBufferCstruct buffer (noOfBytes-1) (init + 1)
else
countInBufferCstruct buffer (noOfBytes-1) (init)
But this version (which is the same except it only looks up buffer.buffer once instead of for every character) is as slow as the original bigarray test!
let rec countInBufferCstruct buffer noOfBytes init =
if noOfBytes = 0
then init+1
else
let b = Char.code (Bigarray.Array1.unsafe_get buffer noOfBytes) in
if b < 128 && b >= 0
then
countInBufferCstruct buffer (noOfBytes-1) (init + 1)
else
countInBufferCstruct buffer (noOfBytes-1) (init)
let countInBufferCstruct buffer noOfBytes init =
countInBufferCstruct buffer.Cstruct.buffer noOfBytes init
Yeah...even I thought Cstruct and Bigarray will have approximately equal times since both are accessing the same bigarray structure (I am not aware of internal memory allocations, though). Also, in the second version which you have provided, since the lookup is done only once, shouldnt it be faster?
On a side note, in a similar eio application of counting characters, I was also checking if I can replace Uring.region.to_cstruct with Uring.region.to_string for a 4 GB file. It turns out that application with to_string was 1.5x faster than the other.
It might be worth checking the assembler output (or pasting the examples into https://godbolt.org/ with OCaml selected). It could also be an alignment issue - I think @ctk21 mentioned getting performance improvements by inserting null-ops in places!
So I took the example over to https://godbolt.org/ and I think I've discovered the problem (I don't know enough about the internals of the compiler to know where the optimisation comes from) but I think it is because the bigarray example doesn't know the layout of the bigarray itself.
See the assembly output from https://godbolt.org/z/EsbW3nEfK compared to https://godbolt.org/z/59Pj71bTT where the difference is adding the type annotation (b : buffer) to the function f, this avoids the (I guess) expensive C call where presumably it has to deal with the potential that the buffer has a fortran_layout or something.
Annotating the buffer function is @deepali2806's code with Cstruct.buffer produces the following with OCaml 4.14
dune exec -- ./main.exe 12871268
Time Difference for Bytes 0.019615
Time Difference for Cstruct_str 0.046756
Time Difference for Cstruct_ba 0.044039
Time Difference for bigarray 0.015194
------------------------------
(and probably go faster with unsafe_get)
Only half-joking here: maybe neither is the good option?
Using a n-dimensional array abstraction which supports floats, etc. (bigarray) is problematic because, if you forget type annotations, the overhead is high. And cstruct is a patch on top of bigarray to make slicing cheaper. Using bytes might be ok in OCaml 5.0 but like @avsm said, it's implementation specific, so not very future proof.
Soooo… why isn't there an equivalent of Cstruct.t that is a small OCaml record (cheap copy/slicing) backed by a malloc'd array of bytes + refcount? With the guarantee that it's always a uint_8_t * behind, no flexibility in layout or anything like that? It should be more lightweight and more predictable than Cstruct.t, and more future-proof than relying on bytes being pinned.
The only reason we don't have a specific malloc-backed IO data type is that the compiler primitives for efficiently marshalling/unmarshalling uint8/16/24/32 with BE/LE support only work on either Bigarray or bytes at the moment.
I think it's just fine to assume non-moving strings in OCaml 5+ in the near term to unblock us. Especially if they are 1KB+ in size, since even the future introduction of a compactor is highly unlikely to move larger word size values around (they are slab allocated via malloc in the 5.0.0 GC).
Moving isn't the only problem. As mentioned at the top of the issue:
- There are severe length restrictions on strings on 32-bit platforms, as the top bits of the length field are used for other things.
- Only cstructs work with memory-mapped data.
I think we can make a non-moving byte work with mmap -- over a certain size, they are slab allocated (and therefore can be suitably aligned), and under a certain size there isn't much use going through the overhead of an mmap.
As for 32-bit, I think we need to make a design decision as to how much that should affect the eio architecture. Upstream OCaml is moving firmly towards only being performant on 64-bit, with 32-bit architectures all bytecode only. I'd be happy with an eio that worked on 32-bit, but with limitations on the size of data it can address.
I think we can make a non-moving byte work with mmap -- over a certain size, they are slab allocated (and therefore can be suitably aligned), and under a certain size there isn't much use going through the overhead of an mmap.
The problem is that the byte block's header needs to go just before the start of the mmap area.
The problem is that the byte block's header needs to go just before the start of the mmap area
Right, but we've had this working before (in MirageOS 1) with either a larger contiguous string which we chunk in a 4K page aligned way (burning the first page which has the OCaml header), or by just allocating a 4K string and adjusting the offset of writes to not overwrite the header.
Both strategies should be compatible with pretty much any reasonable IO backend, since they all support offsets. And cstruct already layers an OCaml record which records the offset/length of the writeable/readable portion of the buffer.