Acknowledge when ZIM is never going to arrive
When for some reasons the Zimfarm fails to notify WP1 that the ZIM is ready, the selection becomes impossible to ZIM again, system seems to be waiting forever. It is impossible to cancel ZIM request, and impossible to request it again, we are indefinitely presented with the spinner.
I suspect this is linked to a recent change.
I feel like continuously polling the Zimfarm to update task status would be a bad idea, but checking task status on Zimfarm once in a while to recover seems mostly mandatory. We could for instance check status everytime user refresh its "My selection" screen, meaning no request when users don't care, but fresh data when they do care. This is what we do on zimit-frontend (zimit.kiwix.org).
The way Zimfarm informs WP1 that a ZIM is ready is an HTTP webhook, without any guarantee of delivery. And it will fail to be delivered from time to time, due to the nature of networks and systems, also named "chaos". We need a system capable to recover from this.
Yes, and reading this issue let me think the hook management in Zimfarm is not robust enough either!?
reading this issue let me think the hook management in Zimfarm is not robust enough either!?
I think Zimfarm hook management follow the "state of the art" (like what most players are doing nowadays); feel free to open an issue in Zimfarm repo explaining why you consider this is not the case and why it is worth investing effort in this if you are not aligned.
@benoit74 Not having the hook succesfully acknowledged by the the WP1 or any hooked third party system (at least HTTP 200) should lead to a Zimfarm fail (or strong warning/trace which could be tracked and investigated later). Do we have this?
I think there is an intrinsic "not a reliable messaging system" problem with HTTP. Zimfarm can send me a POST, I can send an ACK, but how do I know that Zimfarm got the ACK? So Zimfarm sends an ACK-ACK, but how does it know I got that, etc.
To be clear, WP1 does send a 204 when it has successfully processed the webhook: https://github.com/openzim/wp1/blob/main/wp1/web/builders.py#L268
I feel like this is in the nature of webhooks of being "throw and forget".
At least I don't know any SaaS providing webhook which has an acknowledgement system.
We do have logs when notification fails on Zimfarm side (any failure like a bad HTTP response code, no response at all, ...):
We even have a tile in the dashboard we use in weekly infra routine:
But nobody can care / recover from this manually in a realistic fashion. Usual pattern is that systems are using webhooks to avoid polling. But every now and then, there is a background job doing a kind of "reconciliation" to recover from potentially missed webhook.
It is important to note that in general loosing a webhook call never happens. But having something automated in place to recover when it does happen is important. It could be a daily job, it could be a simple cancellation of the ZIM creation (not trying to guess what happened in Zimfarm), but at least WP1 should not be stuck forever waiting for the webhook.
In zimit-frontend for instance, when webhook is lost we do not send the final email to the user (it is lost "forever") but status in web UI is always accurate. It was deemed an acceptable compromise.
Totally agree. Back to the OP issue, I agree, we can check the Zimfarm status directly once when the user visits the ZIM page.