incubator-stormcrawler icon indicating copy to clipboard operation
incubator-stormcrawler copied to clipboard

Detect changes / Update timestamp in the meta data

Open cruftex opened this issue 5 years ago • 8 comments

If the crawl runs frequently to detect changes of a website, it is most likely that the content is actually not changed. This leads to a lot of redundant operations. Idea:

Detect whether there is a change in the extracted content (not the fetched content), e.g. by storing a hash in the meta data.

Only update index, if a change is detected. Index updates are quite expensive.

Store a content hash and an update timestamp in the status metadata.

Further enhancements: Store the last n timestamps of a content update and maybe adjust the fetch frequency accordingly.

cruftex avatar Sep 26 '18 10:09 cruftex

I started to hack in this kind of functionality in the ES status updater and indexer. But that is not the 'right' place for it. Probably it needs to go into the parser, or, there needs to be another step after the parsing, which decides whether the content is send to the indexer or not.

cruftex avatar Sep 26 '18 10:09 cruftex

Have you looked at https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/java/com/digitalpebble/stormcrawler/persistence/AdaptiveScheduler.java ?

jnioche avatar Sep 26 '18 10:09 jnioche

Probably yes, it's some time ago.

The scheduler is changing the fetch interval. That is one aspect of it.

It uses MD5SignatureParseFilter. But the signature makes more sense on the parsed data, not the fetched data, especially if the parsing is expensive and only extracts a few bits from the page.

Also I ran into issues with updating the status after the first fetch.

cruftex avatar Sep 26 '18 10:09 cruftex

Isn't that a case of having a different ParseFilter implementation e.g. MetadataSignatureParseFilter which would generate the same keys in the md as MD5SignatureParseFilter but not based on the text? The logic if the AdaptiveScheduler would remain the same

jnioche avatar Sep 26 '18 10:09 jnioche

Hi Julien, I think the actual feature request is not satisfied:

  • Have an update timestamp, when a change in the parsed data is detected
  • Only send parsed data to the index updater, if a change was detected

Reopen or should I put the different aspects to it in more distinctive issues?

cruftex avatar Oct 02 '18 09:10 cruftex

Its totally okay if you don't want to address this right now. But maybe its good to keep the issue open for somebody else to chime in.

cruftex avatar Oct 02 '18 09:10 cruftex

could split it into 2 different things: the metadata based signature generation on one hand and the partial updates on the other. Let's focus on the first in this issue.

jnioche avatar Oct 02 '18 09:10 jnioche

Have an update timestamp, when a change in the parsed data is detected

AdaptiveScheduler writes the change date detected by signature comparison to the metadata field last-modified. If protocol.md.prefix is not empty, no other component is writing into this field, so the field holds the update timestamp (or the time of the first fetch).

sebastian-nagel avatar Oct 06 '20 18:10 sebastian-nagel

No activity, closing down

jnioche avatar Dec 05 '23 16:12 jnioche