Use timestamps to crawl more intelligent

Open riptl opened this issue 6 years ago • 1 comments

One nice property of file systems is the timestamp on files, it helps to decide which sites are more important to recrawl. The crawler return additional metadata such as the date of the newest file(s) to be reused in subsequent recrawls.

It can then reuse the data if a recrawl is in consideration, to decide if it's really worth recrawling the site. Other sites could be more important after all.

Here are some examples of rules that I made up:

Newest file date rule

The site was last crawled 2016-3-20. The latest file modification was 2005-3-20. It's highly unlikely that anything changed, next recrawl is 2026-3-20 (+XX years). It's 2019-2-13, don't recrawl yet.

Hot paths rule: Directory contents near the newest files

The site was last crawled 2018-1-9 50 hot paths found, request their directory content The hashes all haven't changed, next recrawl is 2018-2-9 (+30 days). It's 2019-2-13, recrawl anyways.

Modification count scoring: How often does a site get changed

The site was last crawled 2016-3-20. Last 50 crawls (starting from year 2014) site wasn't changed (same root Merkle hash) It's highly unlikely that anything changed, next recrawl is 2026-3-20. It's 2019-2-13, don't recrawl yet.

Here are some nice links on the topic. Bing occasionally blogs about their research.

https://searchengineland.com/bing-crawling-indexing-and-rendering-a-step-by-step-on-how-it-works-307592
Bing blog about search engine optimization

Feb 13 '19 01:02 riptl

A priority number (bigger = will be crawled sooner) is already in place for task handling. If you have time to figure out an algorithm for generating that number based on the data that we currently have you can post it in pseudocode below.

What I'd like to avoid:

storing any historical metadata
relying on the crawler to generate that number

What I'll probably end up doing is a periodical query of a subset of websites on ES that would look something like this:

SELECT {Max date? or average date? or some other kind of aggregation based on timestamps?}
FROM Website WHERE id BETWEEN [0,200]

And then generate a priority number for each website for that batch of 200 websites, then start again

Another factor is that files get removed very often and websites with old files are just as likely to close as new ones, so it is important to cleanup the database regularly (although that might be a separate issue)

Feb 13 '19 01:02 simon987