openwayback
openwayback copied to clipboard
Retrieving results after search
We instal·led las openwayback version, reindexed all crawled content using CDX and start to search. Reviewing results table after quering for an URLsome of the results has more than one entry for a date when there's only one crawl done using Heritrix, why? Some times more than one date has an *,I was looking for * meaning but I can't found information.
One possible reason is multiple URLs with slight variants (e.g www vs no-www or http vs https or uppercase vs lowecase) are grouped due to URL canonicalization. Also not impossible Heritrix really did collect the same URL multiple times (check the crawl log).
The * means the content of the page changed on this date as determined by comparing its sha1 digest with the previous snapshot.