specs icon indicating copy to clipboard operation
specs copied to clipboard

Crawl artifacts

Open ato opened this issue 5 years ago • 2 comments

Having a subdirectory for crawl/capture artifacts (configuration files, logs, reports etc) would be useful for the use case of storing or transporting an entire crawl job in a form that's also immediately compatible with WACZ based replay tools. Obviously the details are tool-specific but at least specifying a standard top-level structure would be great. It would be also good to include some specific examples for widely used tools.

While a one to one correspondence between WACZ files and crawl jobs would be a common use case I think it's important to support combining multiple crawl jobs into one larger WACZ collection so perhaps a structure like this would make sense:

/artifacts/heritrix/job1/crawler-beans.cxml
/artifacts/heritrix/job1/20200213092700/logs/crawl.log
/artifacts/heritrix/job2/crawler-beans.cxml
/artifacts/brozzler/myjob/config.yml
/artifacts/httrack/thirdjob/hts-log.txt
/artifacts/httrack/thirdjob/hts-cache/doit.log

ato avatar Jun 13 '20 00:06 ato

On further thought it seems reasonable that one could perform a logical crawl or capture job involving multiple tools so perhaps it'd be better to have the opposite order: /artifacts/{job}/{tool}/{tool-specific path}. As a real-life example IA's contract crawls currently often involve both Heritrix and Brozzler.

/artifacts/two-tool-job/heritrix/crawler-beans.cxml
/artifacts/two-tool-job/brozzler/config.yml

That'd also mean a tool could target another tool's output format, for example if Foocrawler produced a Heritrix-compatible crawl.log it could reside in the same parent job directory under a heritrix path that readers that understand heritrix logs would know to look for even if they don't know anything about Foocrawler.

/artifacts/foojob1/foocrawler/config.ini
/artifacts/joojob1/heritrix/20200213092700/logs/crawl.log

ato avatar Jun 13 '20 03:06 ato

As mentioned in #1, does it make sense to just include them as is, if they're not being used in a specific way?

Thinking about this more, it seems that it would be in conflict with having a smaller spec as suggested in #4

For example, what should a WACZ compatible tool do with an arbitrary set of logs and config files to be WACZ compatible? Maybe the answer is nothing, other than list/extract them, for now?

Perhaps the spec should only define components that have a semantic significance of how they are used, but other data is still allowed in a misc or other directory?

ikreymer avatar Aug 15 '20 07:08 ikreymer