HQ seencheck panics from query timeout
Problem
If the seencheck in preprocessor returns an error of any kind, there is a panic:
https://github.com/internetarchive/Zeno/blob/2abbcd32abdf62cab1eb0031b27f9b13c48c41c7/internal/pkg/preprocessor/preprocessor.go#L140-L142
This behavior makes sense to guarantee that fundamental problems are not occuring during seencheck. However, Zeno will panic and crash if the HQ seencheck request times out which can happen if HQ is operating with degraded performance. Because HQ running slower does not threaten the validity of the data generated by Zeno, there should be a way to avoid panicking from this specific error to continue crawling.
Solution
A couple of ideas:
- changing default behavior to retry if there is a timeout
- increasing the timeout value (or allowing it be set as a runtime flag)
- logging the timeout but not returning an error to prevent panic (this could be enabled/disabled via runtime flag)
Corresponding with the above solutions, there should be improved prometheus reporting on the number of seencheck attempts that fail, are retried, or exceed a certain threshold.
I think that 2 is the safest (as seencheck is a very critical function of Zeno) but all of the above requires some additional investigation on our end to see why seencheck is slow.
We would probably benefit internally from implementing 2 or 3 as an optional runtime behavior (maybe just in a branch) alongside prometheus reporting so we can see how prevalent the issue is