Zeno icon indicating copy to clipboard operation
Zeno copied to clipboard

HQ seencheck panics from query timeout

Open willmhowes opened this issue 1 month ago • 2 comments

Problem

If the seencheck in preprocessor returns an error of any kind, there is a panic:

https://github.com/internetarchive/Zeno/blob/2abbcd32abdf62cab1eb0031b27f9b13c48c41c7/internal/pkg/preprocessor/preprocessor.go#L140-L142

This behavior makes sense to guarantee that fundamental problems are not occuring during seencheck. However, Zeno will panic and crash if the HQ seencheck request times out which can happen if HQ is operating with degraded performance. Because HQ running slower does not threaten the validity of the data generated by Zeno, there should be a way to avoid panicking from this specific error to continue crawling.

Solution

A couple of ideas:

  1. changing default behavior to retry if there is a timeout
  2. increasing the timeout value (or allowing it be set as a runtime flag)
  3. logging the timeout but not returning an error to prevent panic (this could be enabled/disabled via runtime flag)

Corresponding with the above solutions, there should be improved prometheus reporting on the number of seencheck attempts that fail, are retried, or exceed a certain threshold.

willmhowes avatar Nov 03 '25 20:11 willmhowes

I think that 2 is the safest (as seencheck is a very critical function of Zeno) but all of the above requires some additional investigation on our end to see why seencheck is slow.

NGTmeaty avatar Nov 03 '25 20:11 NGTmeaty

We would probably benefit internally from implementing 2 or 3 as an optional runtime behavior (maybe just in a branch) alongside prometheus reporting so we can see how prevalent the issue is

willmhowes avatar Nov 03 '25 21:11 willmhowes