Changed to a persistent queue from a per-level queue, added MaxPages configuration option and tweaked final output

Open williamjulianvicary opened this issue 5 years ago • 0 comments

This change moves the concept of depth to the currently active URL, to enable continuous crawling. Before this change, the crawl speed is only as quick as the slowest URL within the current depth. For example, a timeout of 30 seconds on a URL at a given depth, would be blocking.

This may have been a design choice as there are some clear downsides to this approach; potential for queue bloat and higher ram usage, plus a potential race condition on the accuracy of depth dependent on the speed at which URLs are processed. However on the flip side, this will dramatically increase the performance on websites with slower pages.

Overview of changes:

The queue uses a struct which maintains awareness of the depth
Concept of nextqueue removed
Crawler is no longer aware of the current depth, instead each URL is aware of it's own depth and when merging in new links, this depth is incremented for the future queued urls.

Edit: I accidentally used master for these changes rather than a branch, I can revert and post separate PR's if you'd prefer, but I've also added in a MaxPages configuration option and updated the readme to cover the new functionality. This works by checking the length of the seen URLs within the merge method, skipping merging when the seen count exceeds the MaxPages setting. This is handy for very large websites where you want to limit the scope of the crawl beyond depth/include/exclude rules.

Feb 09 '20 21:02 williamjulianvicary