flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

replace libev with libuv

Open garlick opened this issue 1 year ago • 4 comments

Edit: the primary motivation for this is libuv's ability to be embedded in another event loop via uv_backend_fd(). If flux could make a backend fd available for its event loop, then it would be easier to use Flux in applications that already have an event loop, and to create language bindings that use a native event loop rather than requiring Flux's be used. See also: test-embed.c. Additional benefits may accrue such as streams and thread pool work scheduling.

After a discussion in our meeting last week, I did an experiment to convert flux-core to use libuv.

First, I did some refactoring to localize all the libev usage to reactor.c. This involved reimplementing the buffer watchers, handle watcher, and zeromq watcher using the flux API instead of libev. Then I began systematically converting reactor.c to libuv. Although, as discussed, libuv derives from liibev and thus many of its idioms carry over, I ran into the following problems:

  • no equivalent to ev_child
  • no equivalent to ev_periodic
  • ~~no equivalent to ev_stat~~ (oops, I missed uv_fs_event_t)
  • no equivalent to ev_set_priority()
  • no equivalent to ev_ref()
  • memory for handle (libuv name for watcher) objects cannot be freed in flux_watcher_destroy(). One calls uv_close() on the handle (registering a callback), the reactor runs, and someday you get a close callback, after which memory can be freed. Grr. Because Windows I guess.

It's a fair bit of work to reimplement the missing watchers though probably not a huge deal. I'm not sure what to do about the destructor issue though. It doesn't map well onto our API.

This is on my port_libuv branch if anyone wants to take a look. libflux compiles with the missing functionality. The reactor unit test runs successfully, but when run under valgrind ~~a bunch of invalid reads show up because watcher memory is freed too soon.~~ memory is leaked. There is also an unexplained segfault in the composite watcher test.

garlick avatar Dec 09 '24 06:12 garlick

I hadn't had time in a while to look at flux, but from my perspective this would "fix" #3969 since the two event-loops that were fighting were libuv and libev (with some Julia idiosyncracies on top).

vchuravy avatar Dec 10 '24 05:12 vchuravy

I did come across this just now, which lists some of the caveats to converting to libuv I mentioned above plus more. (I am not sure how up to date this is). Anyway, for reference

https://www.gevent.org/loop_impls.html#limitations-and-differences

garlick avatar Dec 13 '24 17:12 garlick

I pushed an update to my branch referenced above, rebased on current master after we dropped child watchers and added watcher references. The major bits missing/broken are now watcher priorities and periodic watchers. The code builds as normal with libev now, but supports a --with-libuv configure option so, for now, support for both reactors can coexist. This makes it a bit easier to develop since any non-uv changes can be easily tested with libev without shuffling commits around.

The segfault in test/composite_future.t mentioned above appears to be a use-after-free, where a future initialization frees data passed as an argument and then gets called again (with different reactor pointer - so maybe "now" vs "then" contexts) and dereferences it. Note to self: with some debug added, the test goes like this:

good (libev):

ok 82 - flux_future_push success
# future_timeout: alloc dptr=0x578a0ddc4710
ok 83 - flux_future_push timeout success
# timeout_init: r=0x578a0ddc4550 dptr=0x578a0ddc4710
# timeout_init: free dptr=0x578a0ddc4710
ok 84 - flux_future_then worked
ok 85 - flux_future_is_ready returns false
ok 86 - async: composite future is ready
ok 87 - async: retrieved handle child future
ok 88 - async: flux_future_get on child successful
ok 89 - async: retrieved handle to timeout future
ok 90 - async: timeout future fulfilled
ok 91 - flux_reactor_run returned
ok 92 - asynchronous callback called

bad (libuv)

ok 82 - flux_future_push success
# future_timeout: alloc dptr=0x62adeaf2ba60
ok 83 - flux_future_push timeout success
# timeout_init: r=0x62adeaf2e330 dptr=0x62adeaf2ba60
# timeout_init: free dptr=0x62adeaf2ba60
ok 84 - flux_future_then worked
ok 85 - flux_future_is_ready returns false
ok 86 - async: composite future is ready
ok 87 - async: retrieved handle child future
ok 88 - async: flux_future_get on child successful
ok 89 - async: retrieved handle to timeout future
# timeout_init: r=0x62adeaf2fa50 dptr=0x62adeaf2ba60
free(): double free detected in tcache 2
Aborted (core dumped)

Anyway wanted to provide a checkpoint since other tasks beckon.

garlick avatar Jan 10 '25 20:01 garlick

I don't recall seeing this when it was being done, but it was mentioned in one of our recent meetings so I came looking. There are at least some updates here, in that uv_ref/unref now exist at least, and the child handling could be handled one of a couple of ways depending on the requirements (either uv_process_t if we can, or uv_signal_t if not). The others I think are still at least different, and the close item is vaguely interesting, I would think we'd be able to tolerate that even though it is a bit annoying. It's probably safer for io_uring style setups as well as windows, but I'm not 100% sure about that.

trws avatar Jun 16 '25 11:06 trws