forest icon indicating copy to clipboard operation
forest copied to clipboard

Fix task cancelation

Open elmattic opened this issue 2 years ago • 1 comments

Issue summary

In Forest we use in a couple of places the Task::spawn method to spin up new tasks.

Unfortunately for all of them expect the ones in daemon.rs we don’t implement their cancellation making it possible that orphaned child tasks outlive parent tasks. Any value moved to a future/ async block will never be dropped by the task unless the task has cleanly terminated or we are calling explicitly the cancel method on the task handle.

This can result in memory leaks over the lifetime of the client and also hindering us to shutdown gracefully the node (rocksdb Rust wrapper do not always manage to drop itself).

Therefore we should:

  • [ ] Use the new heaptrack doc to create a report on current memory usage/ leaks (https://github.com/ChainSafe/forest/pull/1819)
  • [ ] Conduct any research that could solve the issue, ie:
    • Tasks “structured concurrency"
    • Use arbortable futures
    • Other solution?
  • [ ] Validate that forest can now be gracefully shutdown (https://github.com/ChainSafe/forest/issues/1587)
  • [ ] Perform a second heaptrack analyse and observe if some of the leaks are gone (https://github.com/ChainSafe/forest/issues/1546)

Other information and links

https://docs.rs/async-std/latest/async_std/task/index.html# https://docs.rs/futures/latest/futures/future/struct.Abortable.html https://blog.yoshuawuyts.com/async-cancellation-1 (for “structured concurrency”)

Relevant code where we use spawn without canceling

  • chain_muxer
  • tipset_syncer
  • https://github.com/ChainSafe/forest/blob/main/blockchain/consensus/fil_cns/src/validation.rs#L125
  • https://github.com/ChainSafe/forest/blob/main/node/forest_libp2p/src/service.rs#L260
  • state_manager
  • https://github.com/ChainSafe/forest/blob/main/node/forest_libp2p/src/service.rs#L260

elmattic avatar Aug 22 '22 13:08 elmattic

Could you add links to where we use spawn without canceling or joining the task?

lemmih avatar Aug 22 '22 14:08 lemmih

I've been testing in a standalone sample the futures::future::Abortable type and it looks like it could solve our issue:

  • When the AbortHandle is triggered this drops properly all the values moved to the future.
  • It's runtime independent (async-std, tokio) since it works with futures.
  • Could be used to abort long running futures as well, like this one.

What we would need is some registry (could be part of the state_manager) that can collect for each spawn future its associated handle. When SIGINT is catched, we just call abort() on all handles.

elmattic avatar Sep 26 '22 14:09 elmattic

Note that we should move to tokio tasks and in this case using JoinSet for handling proper task cancelation (as mentionned here).

Once done I think this issue can be repurposed.

elmattic avatar Oct 03 '22 12:10 elmattic