warc-specifications icon indicating copy to clipboard operation
warc-specifications copied to clipboard

Please document possible hopsFromSeed values

Open RvanVeenendaal opened this issue 5 years ago • 2 comments

While validating WARCs at the National Archives of the Netherlands we encountered the hopsFromSeed field. We could not find an explanation of the values, other than on Twitter or in source code of WARC tools. Please add the possible values (or those known to you) to the documentation. E.g. (from Twitter thread of 2015):

  • L link, E embed, X speculative embed (probably from JavaScript) P prerequisite (robots.txt, DNS)
  • I inferred and S submit and/or see Heririx source code https://github.com/internetarchive/heritrix3/blob/d0ebd405782b0c33131ad72e3a76406a475bbf3f/modules/src/main/java/org/archive/modules/extractor/Hop.java

RvanVeenendaal avatar Mar 02 '20 13:03 RvanVeenendaal

The documentation of Heritrix's discovery path is in Heritrix's Glossary but indeed it wasn't very discoverable. I have slightly expanded the explanation and added a mention of hopsFromSeed so hopefully it will eventually start turning up in search results now. I agree that since the WARC standard mentions hopsFromSeed it should include an explanation of the values.

ato avatar Mar 02 '20 14:03 ato

Can this be closed now that hopsFromSeed is documented in the Community Annotations? Or is the motivation here that it should be in a new version of the specification?

edsu avatar Mar 05 '24 16:03 edsu