wp1 icon indicating copy to clipboard operation
wp1 copied to clipboard

Acknowledge when ZIM is never going to arrive

Open benoit74 opened this issue 3 months ago • 7 comments

When for some reasons the Zimfarm fails to notify WP1 that the ZIM is ready, the selection becomes impossible to ZIM again, system seems to be waiting forever. It is impossible to cancel ZIM request, and impossible to request it again, we are indefinitely presented with the spinner.

I suspect this is linked to a recent change.

I feel like continuously polling the Zimfarm to update task status would be a bad idea, but checking task status on Zimfarm once in a while to recover seems mostly mandatory. We could for instance check status everytime user refresh its "My selection" screen, meaning no request when users don't care, but fresh data when they do care. This is what we do on zimit-frontend (zimit.kiwix.org).

The way Zimfarm informs WP1 that a ZIM is ready is an HTTP webhook, without any guarantee of delivery. And it will fail to be delivered from time to time, due to the nature of networks and systems, also named "chaos". We need a system capable to recover from this.

benoit74 avatar Sep 12 '25 08:09 benoit74

Yes, and reading this issue let me think the hook management in Zimfarm is not robust enough either!?

kelson42 avatar Sep 12 '25 08:09 kelson42

reading this issue let me think the hook management in Zimfarm is not robust enough either!?

I think Zimfarm hook management follow the "state of the art" (like what most players are doing nowadays); feel free to open an issue in Zimfarm repo explaining why you consider this is not the case and why it is worth investing effort in this if you are not aligned.

benoit74 avatar Sep 12 '25 09:09 benoit74

@benoit74 Not having the hook succesfully acknowledged by the the WP1 or any hooked third party system (at least HTTP 200) should lead to a Zimfarm fail (or strong warning/trace which could be tracked and investigated later). Do we have this?

kelson42 avatar Sep 12 '25 09:09 kelson42

I think there is an intrinsic "not a reliable messaging system" problem with HTTP. Zimfarm can send me a POST, I can send an ACK, but how do I know that Zimfarm got the ACK? So Zimfarm sends an ACK-ACK, but how does it know I got that, etc.

audiodude avatar Sep 24 '25 17:09 audiodude

To be clear, WP1 does send a 204 when it has successfully processed the webhook: https://github.com/openzim/wp1/blob/main/wp1/web/builders.py#L268

audiodude avatar Sep 24 '25 17:09 audiodude

I feel like this is in the nature of webhooks of being "throw and forget".

At least I don't know any SaaS providing webhook which has an acknowledgement system.

We do have logs when notification fails on Zimfarm side (any failure like a bad HTTP response code, no response at all, ...):

Image

We even have a tile in the dashboard we use in weekly infra routine:

Image

But nobody can care / recover from this manually in a realistic fashion. Usual pattern is that systems are using webhooks to avoid polling. But every now and then, there is a background job doing a kind of "reconciliation" to recover from potentially missed webhook.

It is important to note that in general loosing a webhook call never happens. But having something automated in place to recover when it does happen is important. It could be a daily job, it could be a simple cancellation of the ZIM creation (not trying to guess what happened in Zimfarm), but at least WP1 should not be stuck forever waiting for the webhook.

In zimit-frontend for instance, when webhook is lost we do not send the final email to the user (it is lost "forever") but status in web UI is always accurate. It was deemed an acceptable compromise.

benoit74 avatar Sep 25 '25 08:09 benoit74

Totally agree. Back to the OP issue, I agree, we can check the Zimfarm status directly once when the user visits the ZIM page.

audiodude avatar Sep 25 '25 15:09 audiodude