warcreate icon indicating copy to clipboard operation
warcreate copied to clipboard

Special characters in a web page are mangled when saved to WARC

Open machawk1 opened this issue 11 years ago • 11 comments

For example, in Mediawiki, the →‎ character is saved as character with hex 92.

machawk1 avatar Apr 01 '14 13:04 machawk1

U+2192 → e2 86 92 RIGHTWARDS ARROW

machawk1 avatar Apr 01 '14 13:04 machawk1

Might be due to the characters being turned into an Int8ArrayBuffer wherein → requires more bits. e.g., "3".charCodeAt(0) --> 51 "3".charCodeAt(1) --> NaN "→‎".charCodeAt(0) --> 8594 "→‎".charCodeAt(1) --> 8206

I'm pretty sure the image data needs to be routed through the Int8 function but the HTML (where this problem resides) and probably all text-based content might need to be sent through a different, but similar, Int8 function.

warcGenerator.js, line 9.

machawk1 avatar Apr 01 '14 13:04 machawk1

No dice on simply changing var buf = new ArrayBuffer(str.length) to var buf = new ArrayBuffer(lengthInUtf8Bytes(str)) in str2ab(), ~ line 8 warcgenerator.js. A single 8-byte character is still produced for the out-of-range characters in the WARC.

machawk1 avatar Apr 01 '14 13:04 machawk1

What might be the case is that the content sent to warcgenerator.js as o_request.docHtml is already mangled due to encoding issues of the string...

machawk1 avatar Apr 01 '14 13:04 machawk1

Alternate approach, convert the characters to something encoded, e.g., → to →

This is probably the wrong way to go about it, as it's modifying the content and will likely lead to a world of hurt re:content-lengths.

machawk1 avatar Apr 01 '14 14:04 machawk1

console before send shows correct → character. After send, the character is still preserved as well, so this might come down to the Uint8 issue afterall.

machawk1 avatar Apr 01 '14 14:04 machawk1

The same applies post-concatenation with HTTP headers, so it's not a string concat issue.

machawk1 avatar Apr 01 '14 14:04 machawk1

Test http://warcreate.com/tests/bug50.html Main contents (3 arrows): Bug 50 Test → → →

In WARCreate WARC:

Bug 50 Test →→→

machawk1 avatar Aug 11 '14 15:08 machawk1

There might be hope in the chrome.devtools extension API.

machawk1 avatar Apr 09 '15 19:04 machawk1

"the APIs are available only through the lifetime of the DevTools window."

Thus, the info cannot be extracted unless the devtools window is open. Back to the drawing board.

machawk1 avatar Apr 09 '15 19:04 machawk1

outerHTML is used per https://github.com/machawk1/warcreate/blob/master/js/content.js#L250-L259

Asked for suggestions to resolve this behavior at https://groups.google.com/a/chromium.org/forum/?utm_medium=email&utm_source=footer#!topic/chromium-extensions/YA5xg6PaIVw .

machawk1 avatar Apr 10 '15 17:04 machawk1