pylinkvalidator
pylinkvalidator copied to clipboard
limit scanning reqs/second
We run Linkchecker daily and 99% of the time it behaves but on occasion it seems to run amok and scan a lot of links in a short amount of time. Not sure why this occurs - non of my settings change (run via Jenkins).
I was thinking of adding something like wgets '--wait' flag to limit the requests made? Any thoughts on where the best place to do this would be?
I will take a stab at it and submit a pull-request when complete.
Thanks! jim
Hi Jim,
what kind of workers are you using (process / thread / green threads) and how many? The only time I observed pylinkvalidator scan many links quickly was when the links were quickly returning a bad response (e.g., 404).
Wait would definitively make sense. I'll check tonight where it would work best and post it here.
--workers=2 --timeout=20 --format=csv --mode=process --parser=lxml
We did have someone publish a bad link which resulted in a unusually large # of 404s.
I appreciate the insight!! I'll poke around the code this afternoon as well.
Hi Jim, here are my notes about the wait flag
- I think the wait flag should represent the minimum time each worker should wait before making a request: the number of workers will control de concurrency.
- The flag should be first added to the command line options
- The flag should then be added to WorkerConfig which is sent to every worker so it can configure itself.
- Each worker eventually initializes a PageCrawler (one instance per worker). The page crawler should have a timestamp, e.g., last_fetch_timestamp
- Before opening an url, the page crawler should check whether it should sleep (if now - last_fetch_timestamp >= wait_time)
- Finally, I would add a test with a wait time, say 250ms or 500 ms and check that the test execution time took at least X ms (250 ms * number of page crawled). Example of a test that could be copied and modified.
Thanks so much for the detailed response! I came up with similar steps. Will see if I can find some time this weekend to hack on some code :)