heritrix3 Crawl job stats and reports misleading when excluding PDF-Files (follow up to issue #453)

Using the advice in Issue #453, I successfully excluded unwanted PDF-documents from fetching and being written to WARC. But this method seems to generate misleading reports and stats.

mimetype-report

shows pdf- and zip-files with counts and bytes, both are excluded

[#urls] [#bytes] [mime-types]
6556 234271851 text/html
4193 8659344 application/pdf
42 1829002 image/jpeg
26 508206 text/css
23 239633 image/png
15 811627 application/javascript
14 1462995 application/vnd.openxmlformats-officedocument.wordprocessingml.document
9 18531 application/zip
7 1149664 image/svg+xml
4 49430 image/gif
2 97178 application/font-woff2
2 241457 application/vnd.ms-fontobject
2 240859 application/x-font-ttf
2 124253 application/x-font-woff
2 20934 text/xml
2 4400 unknown
1 212071 application/vnd.ms-excel
1 56 text/dns
1 2419 text/plain

count of content-type from WARC-file

If I grep and count the Content-Type fields from WARC, this is what I get. No pdf and zip:

6702 Content-Type: application/warc-fields
6701 Content-Type: application/http; msgtype=response
6701 Content-Type: application/http; msgtype=request
6190 Content-Type: text/html;charset=UTF-8
 356 Content-Type: text/html; charset=iso-8859-1
  42 Content-Type: image/jpeg;charset=UTF-8
  26 Content-Type: text/css;charset=UTF-8
  23 Content-Type: image/png;charset=UTF-8
  15 Content-Type: application/javascript;charset=UTF-8
  14 Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document;charset=UTF-8
  10 Content-Type: text/html
   7 Content-Type: image/svg+xml;charset=UTF-8
   4 Content-Type: image/gif;charset=UTF-8
   2 Content-Type: text/xml;charset=UTF-8
   2 Content-Type: application/x-font-woff;charset=UTF-8
   2 Content-Type: application/x-font-ttf;charset=UTF-8
   2 Content-Type: application/vnd.ms-fontobject;charset=UTF-8
   2 Content-Type: application/font-woff2;charset=UTF-8
   1 Content-Type: text/plain
   1 Content-Type: text/dns
   1 Content-Type: application/vnd.ms-excel;charset=UTF-8

Crawled Bytes

Total crawled bytes according to the crawl summary are: 249943910 (238 MiB)
Size of the zipped warc file is 70 MB, unzippped 333 MB

Problem

We use the reports and logs in our archive for an overview of the content. In this case, this is dangerous. Is there an explanation and maybe a fix to the problem?

Dec 20 '21 14:12 oschihin

Since neither FetchHTTP choosing to not download the request body nor the WarcWriter choosing not to write the record changes fetch status code of the CrawlURI it's still considered a success for statistics purposes.

As for fixing it, well WorkQueueFrontier.processFinish() is where the decision gets made. A URI is treated either as success, disregarded or failure. I suppose either the definition CrawlURI.isSuccess() and WorkQueueFrontier.isDisregarded() could be changed so URIs with the midFetchAbort annotation are considered disregarded or the abort itself could be changed to call setFetchStatus(S_OUT_OF_SCOPE).

This would have some side-effects though: extractors wouldn't run, the record wouldn't be recorded in the WARC file and the request wouldn't charged to the queue's budget. In your case those are desirable as the goal is for the PDF to be treated as out of scope. I guess the question is if there are other use cases for FetchHTTP shouldFetchBodyRule where those side-effects would be undesirable?

Dec 21 '21 11:12 ato

Another idea is perhaps the full scope should be re-evaluated after the response header is received. This would mean putting a content type decide rule in the normal scope would "just work" and maybe would be less surprising to the operator.

Dec 21 '21 11:12 ato

Thanks for the information. This makes sense, even if it is not a perfect situation for our use case. But if I think about it, we can live with it. We do produce scope.log etc. Even though these logs tend to be pretty large, they show the effects of our "scoping" or appraisal decisions. We would need to explain that, but it makes for transparency.

I am rather sceptical about your second idea, if just for performance and runtime reasons.

Dec 21 '21 15:12 oschihin

heritrix3 heritrix3 copied to clipboard

Crawl job stats and reports misleading when excluding PDF-Files (follow up to issue #453)

mimetype-report

count of content-type from WARC-file

Crawled Bytes

Problem

heritrix3
heritrix3 copied to clipboard