librustzcash
librustzcash copied to clipboard
introduce ability to cancel `scan_cached_blocks`
pub fn scan_cached_blocks<ParamsT, DbT, BlockSourceT>(
params: &ParamsT,
block_source: &BlockSourceT,
data_db: &mut DbT,
limit: Option<u32>,
) -> Result<(), Error<DbT::Error, BlockSourceT::Error, DbT::NoteRef>>
Problem:
scanning a full block can take many seconds on mobile (some times up to 30 seconds) and can't be stopped by callers if needed.
Mobile applications must be able to cancel ongoing work when the operating system requests to or be terminated (SIG_ABORT or SIG_KILL sent from OS). This has many side effects.
When Running background tasks
Comment from Michal Fousek:
I am working on those long running background tasks. And I made it work. I am just not sure if it’s usable in our case. And maybe it would be better to suggest to developers to not do this. Problem is that execution of background tasks is up to system completely. You don’t know when and if the task will run. And it runs just for a few minutes. Which is not a much time. And when it expires there isn’t much time to stop the work. And stopping of SDK can take lot of time sometimes. Like when SDK is scanning block we are not able to interrupt that scanning. I am not sure if the few minutes in background is worth the troubles that can be caused by unexpected states.
Stopping the scanning process at will
Comment from Michal Fousek:
And about the short and simple background task - https://github.com/zcash/ZcashLightClientKit/pull/626. In that PR I implemented just stopping of the SDK when app goes to background. Here you mentioned that scan of block can take up to 24 seconds. Now batch size is 5 above 1_600_000 block height. Which means that one scan call in rust can take up to 120 seconds before we are able to interrupt it. iOS won’t probably give us that long for a background task. And SDK isn’t currently great at resuming of sync process. So the question I am asking if we shouldn’t do any background work at all at this point. And simply let app suspend. It will be easier to implement for developers. And maybe it will result in less troubles.
Proposed solution
Make scan_cached_blocks
cancellable through the FFI. So that callers can halt scanning work on the spot and there's no more delays than a few milliseconds.
Part of the intent of the DAGSync design is to enable precisely this kind of cancellation logic. Futures do nothing unless polled, so we can control the entire DAGSync engine by controlling its runtime; shut down the runtime, and everything else stops. The mobile SDK would call some Rust function, that starts its own thread for the runtime and returns a handle to it. The handle would then let the mobile SDK stop the reactor at any time, and any Rust-created background worker threads doing e.g. trial decryption can just be abandoned (and they will either finish and drop their results, or get killed).
scan_cached_blocks
is not designed to be cancellable: it is a synchronous function intended to be run in a blocking thread. The introduction of a limit
parameter enabled its runtime to be bounded, but as your comments above indicate, limit
is not linear in wall-clock time.
Now, we can potentially modify scan_cached_blocks
to enable something similar to what we plan to do with DAGSync, but it would require scan_cached_blocks
to be used very differently by mobile SDKs (to the point where it is almost a different function). Currently the workflow is the following in a single thread (Android code, iOS code):
- While there are blocks left to scan:
- Call
scan_cached_blocks
to scan up tolimit
blocks. - Update the UI with progress.
- Call
With cancellation, mobile SDKs would instead need to do the following multi-threaded process:
- While there are blocks left to scan:
- Call some new Rust API to get a "cancellation channel" (
send/recv
pair). - Create a new thread that:
- Calls
scan_cached_blocks
with therecv
side of the channel andlimit
. - Notifies the parent thread that it is complete.
- Calls
- Wait for either the child thread to notify that it is complete, or something to happen that requires cancellation.
- If cancellation is required, call a new Rust API with the
send
side of the channel to invoke cancellation. - Wait for the child thread to finish.
- Update the UI with progress or in response to cancellation.
- Call some new Rust API to get a "cancellation channel" (
All of the above is just the new logic that would need to be implemented in both the Android and iOS mobile SDKs. The two new Rust APIs are relatively straightforward, but we would additionally need to thread the cancellation logic through every point in the scanning code where we want to be able to abort. And even then, there will be some boundary that we cannot cancel through (as there are minimum units of work being done).
Given all of the above, I think that instead of introducing the ability to cancel scan_cached_blocks
, it would make more sense to move the mobile SDKs towards the scanning interface that DAGSync will expose, and have that interface internally call scan_cached_blocks
with the above logic. This would move the multi-threaded process out of Android / iOS and into Rust (halving the amount of implementation work), and reduce the amount of work needed to move from this to full DAGSync (more precisely, it is front-loading some of that mobile SDK work in a way that can also be used for DAGSync).
In any case, this issue does imply doing work that would be thrown away once we have DAGSync, but as long as it isn't too much work then the trade-offs probably make sense (in that it helps the mobile UX in the short term).
I'm curious how @str4d's response applies today, a year later. I think DAGSync has been implemented by now, and I'm also interested in cancellation.