datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Non cancel-safe tokio spawns

Open DDtKey opened this issue 1 year ago • 4 comments

Describe the bug

I see a lot of cases of using tokio::spawn without wrapping JoinHandles correctly. So if we stop execution (e.g tokio::select!) - we can find some tasks that are still alive. We can even see panic after canceling an execution (at least I did)

Examples:

To Reproduce

Easiest way, is to create a long-running query which probably may even fail, and track the tokio tasks with tokio console after cancelling the execution. I.e wrap it to select or timeout and check that spawned tasks are still working after cancelling by another branch.

Expected behavior

I believe execution should be cancel-safe, and I think we need to care more about spawned tasks. We already had such issues before, e.g: https://github.com/apache/arrow-datafusion/issues/5178

But I think even AbortOnDropMany doesn't solve all edge-cases, for example there are places when we do return Vec<JoinHandle> or just JoinHandle, but actually the execution can be canceled before wrapping into AbortOnDropMany (if there are any async calls in between). So I'd say https://github.com/apache/arrow-datafusion/issues/6513 is more relevant now, I guess even single task should be spawned with JoinSet

Additional context

How can we protect of introducing this? I think we need to disallow tokio::spawn or pay attention to any spawn during review at least

DDtKey avatar Feb 22 '24 21:02 DDtKey

take

DDtKey avatar Feb 22 '24 22:02 DDtKey

Thank you for flagging and working on this. I'm happy to help if you run into any issues moving to JoinSet as I wrote most of these instances of tokio::spawn.

Another potential cancelation safety issue to consider in the referenced code is use of https://docs.rs/tokio/latest/tokio/io/trait.AsyncWriteExt.html#method.write_all which is not cancel safe.

I don't think the consequences will be too bad canceling in the middle of a multipart upload since I think the ObjectStore itself should provide some protection.

devinjdangelo avatar Feb 22 '24 23:02 devinjdangelo

It works now in #9318 For ReparitionExec I had to wrap each task into JoinSet and collect them to Vec<> to guarantee the order

We still can refactor this somehow later, but I think as a first step it's ok. At least we will have guarantees

DDtKey avatar Feb 22 '24 23:02 DDtKey

Another potential cancelation safety issue to consider in the referenced code is use of https://docs.rs/tokio/latest/tokio/io/trait.AsyncWriteExt.html#method.write_all which is not cancel safe.

I don't think the consequences will be too bad canceling in the middle of a multipart upload since I think the ObjectStore itself should provide some protection.

Yeah, sounds reasonable to me

DDtKey avatar Feb 23 '24 00:02 DDtKey