CCF icon indicating copy to clipboard operation
CCF copied to clipboard

Potential async cleanup problems

Open wintersteiger opened this issue 3 years ago • 5 comments

Follow-up to #3491/#3497. There are a few other parts of the code that look very suspicious to me. For instance:

https://github.com/microsoft/CCF/blob/affe30ff49ceb8f09ddf13ea1513d847601ad5b0/src/host/timer.h#L43

Can the timer object be deleted while a callback is in flight?

https://github.com/microsoft/CCF/blob/affe30ff49ceb8f09ddf13ea1513d847601ad5b0/src/host/ledger.h#L1140

Could this (especially this->to_enclave) be gone by the time this is executed?

https://github.com/microsoft/CCF/blob/affe30ff49ceb8f09ddf13ea1513d847601ad5b0/src/host/process_launcher.h#L96

Is there a way for the ProcessLauncher to get deleted before the callback runs?

CC @eddyashton @jumaffre

wintersteiger avatar Feb 02 '22 15:02 wintersteiger

Searching for ->data) gets me a few more similar pieces of code. src/http_parser.h feels like it might be possible to dereference an invalid parser->data there. src/host/signal.h is very similar to src/host/timer.h.

wintersteiger avatar Feb 02 '22 15:02 wintersteiger

@wintersteiger I believe most of these are safe, and handled by the uv proxy smart pointer. For the Timers for instance, there's a few derived instances constructed in host/main.cpp:

{
  asynchost::Ticker ticker(...);
  asynchost::TimeUpdater time_updater(...);
  asynchost::HandleRingbuffer handle_ringbuffer(...);
}

Those go out of scope on line 611. That only destroys the proxy_ptr/close_ptr, which calls uv_close on the uv_timer_t (indicating that future on_timer calls should not be sent), but doesn't actually delete the underlying object until uv calls with_uv_handle::on_close (which is libuv saying "ok, I agree not to send you any more callbacks, you can clean up your state now").

ProcessLauncher doesn't follow quite this pattern. Instead it's scoped to main, which means it should outlive any callbacks (due to the uv flushing loop at https://github.com/microsoft/CCF/blob/affe30ff49ceb8f09ddf13ea1513d847601ad5b0/src/host/main.cpp#L615-L623).

HTTPParser is inside the enclave, and isn't actually async - it's all executed within some parent call of parser.execute, so the parser will remain alive.

Signal is the same as Timer - a scoped instance within main.cpp, but cleaned up by a close_ptr after uv_close.

That leaves ledger.h. I'll take a look at that.

eddyashton avatar Feb 02 '22 15:02 eddyashton

Awesome, thanks for the quick analysis!

wintersteiger avatar Feb 02 '22 15:02 wintersteiger

I think the Ledger is also cleaned up sufficiently well at the end of main. There are a couple of references to it passed around (e.g. asynchost::NodeConnections), but I think those are safe too.

wintersteiger avatar Feb 02 '22 16:02 wintersteiger

After looking at this again, I think the use of uv_work_t in ledger.h is actually unsafe. The reading task, and the final callback on the main thread, could both happen after the Ledger instance has gone out-of-scope. This is largely fixed by moving the Ledger ledger instance in main.cpp out of the local scope, and keeping it alive for the lifetime of main (ie - constructing it next to ProcessLauncher process_launcher - as in this branch). This at least prevents it accessing a deleted Ledger. However, there's still the risk that these uv_work_ts are leaked - they're not cancelled, and in fact libuv offers no way to cancel them if they're mid-execution. Separately, there's concurrency risks from these accessing Ledger at all while the main thread may still be. I think the on_ledger_get_async worker should be re-written to be purely functional and independent of Ledger (ie - not using its caches at all). Ledger can own a collection of these workers, such that they're cancelled if Ledger is ever deleted, and each can avoid calling back in to Ledger if their complete callback is called with UV_ECANCELED. This resolves most of the issues, though maybe still offers a leak on shutdown if we continue to cap the flushing that libuv does.

Note that these are only issues on shutdown. They shouldn't pose any risk to a running node, but could be responsible for the UV issues we sometimes see on shutdown.

eddyashton avatar Jun 24 '22 14:06 eddyashton