warcreate icon indicating copy to clipboard operation
warcreate copied to clipboard

WARCs of PDF include browser's wrapper

Open machawk1 opened this issue 7 years ago • 0 comments

When viewing a PDF in Chrome, the browser creates a faux DOM as a wrapper of the PDF content. For example:

<html><body style="height: 100%; width: 100%; overflow: hidden; margin: 0px; background-color: rgb(82, 86, 89);"><embed width="100%" height="100%" name="plugin" id="plugin" src="https://uri/of.pdf" type="application/pdf" internalinstanceid="10"></body></html>

In this case, the Content-Type: application/pdf in the HTTP response does not match that which was captured. While it may be a philosophical issue as to whether the wrapper should be preserved, there seems to be an issue with capturing the PDF's contents. This may be due to WARCreate checking with the faux DOM for embedded resources and there being a conflict with the Content-Type (really application/pdf but the presentation contains content like text/html).

Also, it may be questionable whether WARCreate is currently configure to scrape this URI in the embed, generated by Chrome.

machawk1 avatar Nov 28 '18 20:11 machawk1