Idris2-Erlang
Idris2-Erlang copied to clipboard
Improve IORef/Buffer implementations
Data.IORef
, Data.IOArray
(Built on top of Data.IORef
) and Data.Buffer
are all abstractions that provide mutability. Implementing them in Erlang turns out to be not so easy (to me, at least).
I recommend to not use the Data.IORef
, Data.IOArray
and Data.Buffer
modules in your code if you plan to generate Erlang, as these primitives currently leaks memory.
Solutions considered | Problems encountered |
---|---|
ETS | Leaks memory because the run-time does not know when to remove the values. |
Custom gen_server | Leaks memory because the run-time does not know when to remove the values. |
Process dictionary | Leaks memory because the run-time does not know when to remove the values. |
Atomics | Didn't find a way to resize the Atomic after it is created. |
NIF | Read more below. |
NIF
I implemented both the IORef and Buffer types as a NIF: https://github.com/chrrasmussen/mutable_storage [Rust]
The Buffer implementation works as expected, and IORef implementation mostly works. The blocker is that IORef can't store Erlang references, or rather, if it does, the Erlang reference gets serialised and value pointed to by the reference is potentially garbage collected. When reading back the Erlang reference it might not point to anything, which would lead to run-time error if accessed.
Using NIF also makes the project harder to distribute.
Possible solutions
It might be possible to solve the issue in the Idris 2 compiler (or the code generator)
- By applying some form of reference counting in the generated code. Would cause some extra run-time overhead.
- Using the linear type in some way. Currently, information about the linear types are not exposed to the code generators.
Or a long shot:
- The Erlang VM could provide a mutable box type (Same functionality as
IORef
)- Similar to Atomics, except it can store any Erlang term and keep other Erlang references alive.
- Also similar to ETS, except the values would be garbage collected.
- With a mutable box type, it should be possible to implement all of the primitives above (
IORef
,IOArray
andBuffer
)
Current implementation
The current implementations of Data.IORef
and Data.Buffer
are using process dictionary (ETS would probably work as well), which means they leak memory.
One 'solution' could be to implement IORef as a NIF which stores the value that is put inside fully inside the Rust struct (using e.g. serde_rustler
to convert BEAM types to Rust structs). Now the IORef as a whole can safely be given to the BEAM runtime which can GC it when necessary, which will at that point GC the contents of the IORef as well.
The main disadvantage over storing Erlang references is obviously that we need to copy datatypes when putting them in the IORef rather than being able to rely on Erlang's builtin reference counting here. However, it should not leak memory as the IORefs themselves are now GC'd correctly.
Interestingly Data.Buffer
could probably be very easily implemented using the process dictionary, ETC or one of the many other choices above since currently Idris2 requires buffers to be freed manually as far as I can see using the freeBuffer
primitive.
Another question: Why is it necessary to be able to resize the internals of an IORef? If this is not actually necessary you might be able to use :atomics
after all.
Thanks for all your suggestions! 😃 I need to look more into them. For now, I will try to answer your questions with my current understanding.
Data.IORef
The blocker is that IORef can't store Erlang references, or rather, if it does, the Erlang reference gets serialised and value pointed to by the reference is potentially garbage collected. When reading back the Erlang reference it might not point to anything, which would lead to run-time error if accessed.
To illustrate this problem in terms of Erlang. I am using atomics
in this example, but it applies to any references. The following example works as one would expect (returning the value 0). Run them in the Erlang REPL:
Ref = atomics:new(1, []).
atomics:get(Ref, 1).
The problem is that when the reference is serialised, there are no more references, and the atomics
gets garbage collected.
SerialisedRef = term_to_binary(atomics:new(1, [])).
atomics:get(binary_to_term(SerialisedRef), 1).
Results in an error:
** exception error: bad argument
in function atomics:get/2
called as atomics:get(#Ref<0.534261318.1692008460.78940>,1)
Resizing IORef
The way Data.IORef works is that the value it contains can be changed at any later point. The size of the value that is stored in the IORef might vary wildly. There are no upper bound to how much data can be stored in the IORef, which means it needs to be resized if it is too small.
import Data.IORef
main : IO ()
main = do
ref <- newIORef "small string"
smallStr <- readIORef ref
putStrLn smallStr
writeIORef ref "very long string. very long string"
longStr <- readIORef ref
putStrLn longStr
Prints:
small string
very long string. very long string
Both strings are written to the same IORef-reference. I was thinking that maybe it is possible to create a new atomics
, but that would also lead to a new Erlang reference.
Data.Buffer
A quick search in the Idris 2's source code indicate that the freeBuffer
function is not used. If I remember correctly, the Buffer implementation was changed to use a C implementation at one point (instead of the Scheme implementation), but it was later changed back to use the Scheme implementation. I think freeBuffer
is a remnant from that change.
There are no upper bound to how much data can be stored in the IORef, which means it needs to be resized if it is too small.
I see. For some reason I thought that it would always itself contain a reference but on hindsight that does not make any sense. That indeed is a clear reason why :atomics
would not work.
Also thank you for more information about Data.Buffer
.
I expect that the easiest way forward would then be to implement it as a NIF.
After looking thorougly at the documentation of Erlang's NIFs I found out that it is possible to pass a destructor function when creating a custom resource type.
Looking deeper inside Rustler
, it seems that we can create our own cross-process reference-counted box to arbitrary data by using Rustler's existing ResourceArc. It probably already does what we need:
As far as I can see, ResourceArc will not serialize what is to be stored inside, but instead increment the reference pointer of the thing it contains on construction and decrement it on destruction. Its own destruction is of course triggered iff all of the references to the ResourceArc are themselves GC'd by the Erlang VM.
Thanks again! If it is possible to avoid serializing the data from Erlang, that might work! My current implementation of IORef is built on top of Buffer, which means the data from Erlang is serialized.
I added a small test file that reproduces the error (in branch ioref-test).
My Rust skills is not really up to par. If you would like to give it a try, I would be very happy 😊
I had to go through some hoops in order for the :atomics
reference to be deallocated. You can run it using: mix run ioref_test.exs
Author of Rustler here, just dropping in to give some context/clarifications.
What NIF resources do, is they allow you to opaquely store data inside a handle that is managed and garbage collected by the erlang VM. They pretty much only wrap a pointer, and do not allow storing terms inside by themselves.
When using ResourceArc
(our wrapper for resources) in Rustler, you can simply implement Drop
for your inner type, and do whatever you need to do when the type inside the ResourceArc
is dropped. Drop
is a standard rust thing, we just call the drop implementation when the BEAM calls the destructor.
Interestingly, there is actually a way to own and store terms in native data structures in NIFs, and that's owned environments. However, the caveat here is that this requires copying terms into and out of that owned environment whenever you want to pass it to/from the process the NIF runs in. There is also no way to deallocate individual terms from the owned env without clearing the whole thing.
If you simply need to serialize a terms as a binary, the most performant and simple way would be to use Term::to_binary
which will encode the term in the ETF format. This supports all term types, but you would still have a problem with references getting GCd.
Thank you for the insights, @hansihe! ❤️
Also thanks for making Rustler! It was very easy to get it going, even for one that was completely new to Rust.
By the way, IORef and Buffer are very low-level primitives which Idris2 uses for increased efficiency. However, I think that it might very well be possible that a couple of datatypes which Idris2 builds on top of IORefs/Buffers would actually be more efficient in Erlang when implemented in their more 'natural' way, since Erlang itself already performs a lot of efficiency-optimizations for e.g. small binaries vs large binaries, small maps vs large maps, iolists, etc.
@Qqwy That's true.
I think the biggest reason for supporting IORef and Buffer is that they might be used in some Idris 2 libraries. Looking at the modules in the Idris 2 libraries (prelude
, base
, contrib
) I found the following usages:
- IORef
-
Control.App
(base
) -
Control.Monad.ST
(base
) -
Data.Ref
(base
)
-
- Buffer
- Only used in the Idris 2 compiler
Another reason is that IORef and Buffer are used in the Idris 2 compiler. The Idris 2 compiler is already running on the BEAM, but it would be even nicer if the Erlang version was close to the same performance as the Chez Scheme version, and that it did not leak memory. With that said, there might be other ways to achieve this: By rewriting the parts of the compiler that uses IORef and Buffer. Rewriting just these parts might not be sufficient though.
In general, I would say that IORef and Buffer are not needed. When writing Idris 2 code that is intended for Erlang there are also other options, like using ETS, GenServer etc.