pylinkvalidator icon indicating copy to clipboard operation
pylinkvalidator copied to clipboard

limit scanning reqs/second

Open jimpriest opened this issue 9 years ago • 4 comments

We run Linkchecker daily and 99% of the time it behaves but on occasion it seems to run amok and scan a lot of links in a short amount of time. Not sure why this occurs - non of my settings change (run via Jenkins).

I was thinking of adding something like wgets '--wait' flag to limit the requests made? Any thoughts on where the best place to do this would be?

I will take a stab at it and submit a pull-request when complete.

Thanks! jim

jimpriest avatar Sep 15 '16 13:09 jimpriest

Hi Jim,

what kind of workers are you using (process / thread / green threads) and how many? The only time I observed pylinkvalidator scan many links quickly was when the links were quickly returning a bad response (e.g., 404).

Wait would definitively make sense. I'll check tonight where it would work best and post it here.

bartdag avatar Sep 15 '16 14:09 bartdag

--workers=2 --timeout=20 --format=csv --mode=process --parser=lxml

We did have someone publish a bad link which resulted in a unusually large # of 404s.

I appreciate the insight!! I'll poke around the code this afternoon as well.

jimpriest avatar Sep 15 '16 15:09 jimpriest

Hi Jim, here are my notes about the wait flag

  1. I think the wait flag should represent the minimum time each worker should wait before making a request: the number of workers will control de concurrency.
  2. The flag should be first added to the command line options
  3. The flag should then be added to WorkerConfig which is sent to every worker so it can configure itself.
  4. Each worker eventually initializes a PageCrawler (one instance per worker). The page crawler should have a timestamp, e.g., last_fetch_timestamp
  5. Before opening an url, the page crawler should check whether it should sleep (if now - last_fetch_timestamp >= wait_time)
  6. Finally, I would add a test with a wait time, say 250ms or 500 ms and check that the test execution time took at least X ms (250 ms * number of page crawled). Example of a test that could be copied and modified.

bartdag avatar Sep 16 '16 00:09 bartdag

Thanks so much for the detailed response! I came up with similar steps. Will see if I can find some time this weekend to hack on some code :)

jimpriest avatar Sep 16 '16 12:09 jimpriest