Idris2
Idris2 copied to clipboard
RefC Backend + makeString memory problems
The way strings are handled as (char *) poses problems. I don't see an easy way out. Here is the deal:
Problem
- The function
Value* makeString(char*)creates a copy of the input (including malloc) to create the string object. - Functions, such as
fastConcatalso allocate new memory (which is not freed anywhere but should be) - Functions, such as
idris2_strerror(int)return a pointer to an error message (which should not/can not be freed) - The RefC backend indiscriminately creates the following code
{
// ffi call to _charPointer_returning_function_
char * retVal = _charPointer_returning_function(...);
return (Value*)makeString(retVal);
}
Thus, invoking functions such as fastPack, or fastConcat, but also idris2_currentDirectory leak memory.
As an example, I compiled Idris itself with the RefC backend (with minor changes along the way to make it compilable), and started it. Just displaying the REFL and immediately exiting, without any actual computation leaks about 1MB already.
Potential solutions
I can see three different approaches, I don't know which ones are feasible. I probably also miss an obvious solution, though.
1. changing the datatype of String from char* to something like a struct with freeing information
typedef struct {
char* c;
int memory_managed_externally;
}Idris2_CString;
Functions like fastConcat would then return a struct with the flag set to 0, so makeString can free the pointer, whereas other functions may set it to 1.
2. Create a convention that all FFI functions that return char* have to allocate new memory.
This new memory is then consistently freed by makeString (resp. the garbage collector whenever the String object goes out of scope).
Both of these options might break existing code (resp. FFI calls) that interacts with Idris.
3. Use different primitives in the Idris code to indicate whether or not the char* should be kept or discarded.
This, however, seems to break the code the most. CFTypes would need to distinguish between StringFreeable and StringNonFreeable. Consider:
%foreign "C:idris2_getString, libidris2_support, idris_support.h"
"javascript:lambda:x=>x"
export
prim__getString : Ptr String -> String
used in functions such as fGetLine:
fGetLine : HasIO io => (h : File) -> io (Either FileError String)
fGetLine (FHandle f)
= do res <- primIO (prim__readLine f)
if prim__nullPtr res /= 0
then returnError
else ok (prim__getString res)
now prim__readLine returns a char * (which is then as pointer, not as string) passed to prim__getString, which converts it into a char* (very nice), but then calls makeString which allocates new memory (bad) => memory leak. This would need to change to
%foreign "C:idris2_getString, libidris2_support, idris_support.h"
"javascript:lambda:x=>x"
export
prim__getStringFreeable : Ptr String -> StringFreeable
Very ugly, and maybe not even possible.
Personally, I prefer the second option, but I don't have a strong opinion.There are not (yet) a lot of C libraries that interact with Idris2, and the changes seem minimal.
FWIW, I agree the second option looks the most natural. First option might be more flexible, but maybe it could lead to unnecessary complexity in the future?
I guess the main point that @vfrinken makes is this: the current code is quite substantially broken, so someone does need to go through all the C FFI functions (in this code base, but also in external code bases) one by one and implement whichever solution we choose.
So the decision is not really whether to make any breaking changes, but how to make the function-by-function fixes straightforward, possibly by following a simple decision procedure.
I think (2) describes roughly what the FFI documentation (undoubtedly written mostly or entirely with the Scheme backend in mind) describes currently, if I am understanding correctly.
Snippet from the Idris documentation:
A char* returned by a C function will be copied to the Idris heap, and the Idris run time immediately calls free with the returned char*.
Yes, @ohad, that is exactly the point I was trying to make. I went through the CFFI functions used in the Idris code itself and made the changes (only the direct call to getenv(...) needed a workaround) and was able to get a code base that was free of memory leaks in one afternoon/evening, so that is at least proof-of-concept that it is doable. (https://github.com/vfrinken/Idris2-RefC-Coronado) That code also implements the PR about the GC change.
The Idris documentation is quite clear, but the Idris code itself doesn't follow the guideline. :-(
Anyway, I'm going to argue in favor of 1. now, because I like it, too:
- Having a struct with meta-information about a string is more flexible, particularly if we want to support different encodings in the future. The struct could have some information how the input is encoded.
- A fundamental break in how strings are imported/exported forces a user to explicitly think about the memory issue, instead of some implicit convention. To write bug-free code, maybe this is necessary. Also, the compiler would complain if you overlooked a function without changing
I won't have time to code anything in the next few days, maybe even until after the Developer Meeting. If by that time the consensus is on 2. I'm happy to turn my code into a PR.
A char* returned by a C function will be copied to the Idris heap, and the Idris run time immediately calls free with the returned char*.
This is exactly what's problematic. You can't return char* allocated by another allocator. Maybe the data is on the stack? Maybe the data is in .data or .text?
I didn't see this until after I filed #2552, but the scheme backends have the same problem. They appear to be leaking a lot of memory (e.g. 100k leaks from fGetLine when loading parts of the compiler source in the REPL / LSP).