NuProcess
NuProcess copied to clipboard
Don't depend on the GC to free native memory on Linux
JNA is freeing 192 KiB of per NuProcess native buffers on GC Finalizer / Reference Queue processing. In cases with very low GC activity, this is essentially a memory leak, potentially causing native memory issues.
- out/err/in buffers are freed onExit,
- event needed for Epoll #registerProcess and #queueWrite is allocated and freed in the method,
- tweak in reusing of IntByReference for duration of epoll/kqueue processor,
- updated JNA library to 5.13.0, to get #close on Memory object (added in 5.12.0).
@bturner What are your thoughts on explicitly freeing the per process EpollEvent on call to onExit()? The complication is that some synchronization guards are needed to avoid any API misuse that can cause SegFault.
E.g. process.start().waitFor(0,Days);
then doing process.wantWrite();
can cause SegFault in rare case (easy to reproduce) if EpollEvent native memory is freed onExit and no additional guards are put in place in queueWrite
.
Also what are you thoughts of making the buffer configurable? Like a global setting?
What are your thoughts on explicitly freeing the per process EpollEvent on call to onExit()?
For this review, I'd leave it out. Synchronization brings in its own costs that need to be balanced. I'm not saying we shouldn't do it, but rather that it's better to separate "contentious" changes from the simple ones that can be readily merged. That way I can accept this pull request and do a new release with some improvements, and then we can have a more focused discussion about further improvements and the trade-offs they entail.
Also what are you thoughts of making the buffer configurable? Like a global setting?
I'm not sure what buffer you're referring to. The size for the stderr
, stdout
and stdin
buffers? Something else?
Ok agreed. I'll put the explicit close of per process EpollEvent in another pull request.
Yeah, I saw the offset handling regarding the EPoll struct. Wow. Didn't image JNA to be this inefficient to merit doing it yourself with offsets. Not very convenient then.
fwiw: doing some simple testing also not seeing the benefit of polling 20 instead of 1 event. I do consistently see 2 events queried. But anyway, for now I left the poll alone.
For the configuration I meant for setting of NuProcess#BUFFER_CAPACITY. For my use case, I think 4K would work just fine.
I'll clean up the code and let you know when it is ready to be reviewed again.
V V pon., 12. feb. 2024 ob 19:23 je oseba Bryan Turner < @.***> napisala:
What are your thoughts on explicitly freeing the per process EpollEvent on call to onExit()?
For this review, I'd leave it out. Synchronization brings in its own costs that need to be balanced. I'm not saying we shouldn't do it, but rather that it's better to separate "contentious" changes from the simple ones that can be readily merged. That way I can accept this pull request and do a new release with some improvements, and then we can have a more focused discussion about further improvements and the trade-offs they entail.
Also what are you thoughts of making the buffer configurable? Like a global setting?
I'm not sure what buffer you're referring to. The size for the stderr, stdout and stdin buffers? Something else?
— Reply to this email directly, view it on GitHub https://github.com/brettwooldridge/NuProcess/pull/151#issuecomment-1939296559, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACRDOL5KOPXAAXLVQ57V3TYTJMZ7AVCNFSM6AAAAABDDYRJMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZZGI4TMNJVHE . You are receiving this because you were mentioned.Message ID: @.***>
fwiw: doing some simple testing also not seeing the benefit of polling 20 instead of 1 event. I do consistently see 2 events queried. But anyway, for now I left the poll alone.
That matches what I saw before. I could see >1 event being returned, but performance and profiling characteristics were identical either way. (Polling for more can reduce the number of syscalls, but since handler processing dominates the performance so strongly, taking away milliseconds (or, more likely, nanoseconds) of syscall execution simply doesn't move the needle.) Based on that, I think leaving it at single poll is probably fine (and keeps the code simpler).
For the configuration I meant for setting of NuProcess#BUFFER_CAPACITY. For my use case, I think 4K would work just fine.
Brett and I had previously at least touched on the idea of making that configurable (see here). A system property approach would be pretty straightforward, so it might be a good place to start. I've thought before about what might be involved in making it more readily runtime-configurable (e.g. so that, when you start a process, you can request a buffer size as part of that, because not all processes have the same I/O needs), but haven't actually sat down to try and do it.
Have same thoughts about setting of out/err/in buffer sizes. I'll prepare another pull request for system property approach. I also think it is a good place to start. For more fine grained control, process builder is probably the place to add it. But maybe unnecessary.
@bturner I think this is it. Updated JNA to get the #close method, out/err/in buffers are closed and added the tweak to reuse IntByReference pointer by the epoll/kqueue.
Btw: wow, the Kqueue is using the Struct KEvent, indeed it is heavy. I expected some bespoke code generators from JNA. Not even close.
I still have my eyes on the per process Event allocation that isn't freed right away. I'll think of something.
For the per process event. I know you'd prefer to do this in a different PR. But I think I came up with a solution that is simple enough.
Event is needed by the registerProcess()
and by the wantWrite()
and writeStdin(ByteBuffer)
.
For the registration. We can allocate the event and free right away inside the registration method. This way the per process event is not needed. If you don't do stdin stuff no further allocations are necessary.
For stdin handling. I prefer on demand init of the event that is then used by calling wantWrite and writeStdin. Technically multiple threads can call wantWrite and writeStdin. But the race is not an issue since all threads are setting the same values that are atomic (on x86 at least).
event.setEvents(LibEpoll.EPOLLOUT | LibEpoll.EPOLLONESHOT | LibEpoll.EPOLLRDHUP | LibEpoll.EPOLLHUP);
event.setFileDescriptor(stdin);
If 2 threads would be calling writeStdin
and the fields would be 64bits, a tear could occur. But this is effectively tread safe (on platforms where 32bit are atomic). I see you use ConcurrentLinkedQueue<ByteBuffer>
so I assume concurrent access to writeStdin
is expected.
Apart from on demand per process init of Event when doing wantWrite
and writeStdin
. I also added closeStdinEpollEvent
for immediate release (if don't want to wait on GC). This way the control of Event is on the caller as we cannot be sure no further wantWrite
and `writeStdin' will be done on exited process. So we cannot use onExit to close the stdin event.
Let me know what you think. Like mentioned I want all of the malloc stuff to be explicity freed when process exits. With this changes this is achieved for start()
and also run
.
Aside: Expected run
to return int.
That approach doesn't really make sense to me, unfortunately.
- It requires more allocations, since the
EpollEvent
for registration is no longer reused forwantWrite
; anything that will write will now result in creation of (minimum) 2EpollEvent
s where the existing code would allocate 1 - It leaks the implementation details above the API, because the caller is now made aware that a) there's an
EpollEvent
that b) needs to be manually released
For point 1, if we're not massively concerned about allocating a few EpollEvent
s (and, so long as they're explicitly released, I'm not sure we need to be), then I'm not sure why you wouldn't make a similar change in ProcessEpoll::queueWrite
to what you have made in registerProcess
and "just" have queueWrite
create a short-lived EpollEvent
, use it to register and then close
it.
For point 2, if we don't want to allocate an EpollEvent
in each queueWrite
call, I would revert your changes to registerProcess
, so it uses the single EpollEvent
, which would continue to be allocated during LinuxProcess
creation, for both registration and writes, and I'd replace closeStdinEpollEvent
with a more generic releaseNative
or cleanup
. This leads to no more allocations than the code currently has, while still giving callers a mechanism to "force" cleanup rather than waiting.
If we're going to add a new method, I don't think it's "correct" to add it just to LinuxProcess
. Callers shouldn't be casting to the specific process classes; they should be using NuProcess
. Based on that, I'd add whatever method we add for this to that interface. (It can't have a default
implementation since we still support Java 7, but the existing classes can add no-op implementations.)
Personally, at this point, I'm inclined to go with "just" allocating EpollEvent
s in queueWrite
and closing them, or just leave the code as-is. Adding new "cleanup" handling through NuProcess
feels like a really heavy solution to a "problem" I'm still not sure I understand. (Read: Optimizing for this feels like a bit of a waste of time, to me. JNA will free the native memory; it's just not the sort of deterministic schedule you personally would like it to be on. At 12 bytes per EpollEvent
, you'd need to have allocated tens of thousands of them for the memory usage to add up to anything significant--and you'd need JNA to have not freed any of them on top of that. In multiple years of working on Bitbucket Server with it using NuProcess under the hood, we never once experienced internally or received a report externally of an instance of an OOM error related to JNA's native memory handling--and that's with the 192KB of buffers not being explicitly freed on top of the 12 bytes for an EpollEvent
, despite forking millions of git
processes over the JVM's lifetime.)
@bturner If you are cool with allocating/freeing per queueWrite
this simplifies things greatly.
I think JNA and also JNI does some small allocations per method call anyway. So probably not a big deal if we do as well.
I've updated the title to be less generic and updated the first comment of this PR.
I've updated the code to reflect the latest discussion.
@bturner I left the run
use case alone. Could also add explicit free for run
use case. But would have to add a method probably releaseNative
to the processor. I actually added this for Linux but then it makes more sense to add it to the top level and implement for all of the 3 OSes. So I reverted the change.
Leaving run
use case as is. Allocate and free per method for register and queueWrite works for me. We can leave freeing per processor jna fields alone as they don't cause issues with GC for start
use case. They are per classloader
for start
use case. This works fine for me.
To reiterate the problem I am solving. In some GC setups. It might take hours between GC cycles that trigger reference processing. In this time nonnegligible amouth of native memory accumulate that cause native memory issues.
See: https://mail.openjdk.org/pipermail/zgc-dev/2023-August/001260.html