squad
squad copied to clipboard
ci: models: avoid polling jobs waiting more than a week on the backend
SQUAD has no way to tell whether a TestJob has been worked on by its backend. It might be that the device is out or that the backend is undergoing a unusually long maintenance. Overtime, jobs in this scenario will start clogging up the fetch queue, delaying other fetched jobs.
I hardcoded this feature for a week, because that's the usual behavior I noticed, but this can be done via a backend setting as well if requested.
@mrchapp There are a few jobs in NXP that are waiting over a week in their LAVA instance (like this one https://lavalab.nxp.com/scheduler/job/744848), and this PR acts exactly on this kind of job. Specially on NXP, there are old hanging jobs that take ~10 seconds to get a response from the LAVA instance.
I'm thinking that we eventually want to get those lagged results, even if only for data mining purposes.
Can we ping the LAVA server first and determine if a round of fetching should be initiated based on that? I guess what we want to avoid is the continuous time-outs from an unresponsive server.
You have a good point. I think I will revisit the LAVA/SQUAD signals and have LAVA tell SQUAD when a job is ready for fetching. Sometimes jobs have Submitted
status, but sometimes that's not the case.
I don't fully understand why.
By default SQUAD attempts fetching jobs despite whatever signal LAVA sent wrt the job.
One solution is to have SQUAD avoid polling jobs with status="Submitted"
(like this NXP job). Then whenever LAVA signals SQUAD that the job is ready, it'll be queued then fetched. Downside is that if the lava lab failed to notify SQUAD, the job will never be fetched.