spidermon icon indicating copy to clipboard operation
spidermon copied to clipboard

issue: Wrong previous jobs count in ZyteJobsComparisonMonitor

Open curita opened this issue 1 year ago • 0 comments

Background

The method to get the previous jobs used in ZyteJobsComparisonMonitor can be seen here:

https://github.com/scrapinghub/spidermon/blob/a7e195f951a4b9a4837943dff0b2ae727b76f237/spidermon/contrib/scrapy/monitors/monitors.py#L557-L579

where number_of_jobs's value is defined by the SPIDERMON_JOBS_COMPARISON setting.

Issue

Supposedly, we should get the SPIDERMON_JOBS_COMPARISON number of previous jobs to compute the average, as stated in the docs, but we might get more because of how those jobs are iterated.

Debugging

  • The first _jobs = client.spider.jobs.list(..., count=number_of_jobs) call will get number_of_jobs at most, as it is expected.
  • The function then keeps iterating, skipping 1000 jobs ahead (start+= 1000) to fetch #number_of_jobs more.
    • That 1000 number might come from the fact that client.spider.jobs.list() doesn't allow the return of more than 1000 jobs, so the intention might have been to paginate the results to get everything. But because we only want #number_of_jobs (which should be less than 1000, otherwise the function breaks), then this causes us to fetch more jobs than requested instead.
  • It does so until there are no more jobs available to list

For example, if some spider has 3500 jobs that meet the requested criteria and SPIDERMON_JOBS_COMPARISON is set to 10, then this will happen:

  • Round #0: len(_jobs) == 10 (client.spider.jobs.list(start=0, count=10))
  • Round #1: len(_jobs) == 20 (client.spider.jobs.list(start=1000, count=10))
  • Round #2: len(_jobs) == 30 (client.spider.jobs.list(start=2000, count=10))
  • Round #3: len(_jobs) == 40 (client.spider.jobs.list(start=3000, count=10))
  • Round #5: stop (client.spider.jobs.list(start=4000, count=10))

Resulting in 40 previous jobs, when only ten were requested.

Expected Behaviour

Only SPIDERMON_JOBS_COMPARISON number of jobs should be fetched as previous jobs, and those should be the latest ones that meet the requested criteria.

curita avatar Apr 17 '24 20:04 curita