warcreate icon indicating copy to clipboard operation
warcreate copied to clipboard

Content length in WARC response records for responses that contain binary image data is incorrect

Open machawk1 opened this issue 10 years ago • 7 comments

Fixing this would probably fix a few other issues down the line.

machawk1 avatar Feb 11 '14 14:02 machawk1

The image content is corrupted as compared to an Archive-It WARC. Something's not write in the JS code that is storing the image data. Encoding, maybe?

machawk1 avatar Feb 11 '14 15:02 machawk1

Hex 89 is becoming hex EFBFBD. This sounds waaay too familiar, like a BOM issue.

machawk1 avatar Feb 11 '14 15:02 machawk1

Part of the problem is that the call to fetch the image data via Ajax has required synchronicity for string building. Otherwise an arraybuffer or a Blob (see https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest/Sending_and_Receiving_Binary_Data ) could be used, except the W3C spec says that with these data types must be fetched via Ajax using async.

machawk1 avatar Feb 12 '14 18:02 machawk1

See http://stackoverflow.com/questions/21708000/acquring-raw-image-data-when-fetching-image-using-ajax

machawk1 avatar Feb 12 '14 18:02 machawk1

An alternative might be to try to capture the image data using the Chrome facilities when it first comes in but the response handlers don't seem to have access to this data.

machawk1 avatar Feb 12 '14 18:02 machawk1

Woo, created a basis solution! Now, to scale it.

var hexValue = 0x89; var png = "PNG";

var hexValueArrayBuffer = new ArrayBuffer(1); var hexValueInt8Ary = new Int8Array(hexValueArrayBuffer); hexValueInt8Ary[0] = hexValue;

var blob = new Blob([hexValueInt8Ary,png]); saveAs(blob,"out.txt");

machawk1 avatar Feb 14 '14 15:02 machawk1

Content length is now correct for simple case (mkdc) but not for large cases (e.g., CNN.com, FB)

machawk1 avatar Feb 18 '14 14:02 machawk1