amppackager Make a CLI tool for diagnosing payload errors

trafficstars

Tool would take AMPHTML on stdin, run parse-n-print (Process with no transformers) twice in series, and then return nonzero status if PnP(doc) != PnP(PnP(doc)). Optionally, output some sort of human-readable diff.

Apr 29 '19 23:04 twifkak

Can you add some context to this issue about what the use case would be for this tool.

Apr 30 '19 15:04 Gregable

cmd/transform/transform.go exists already to do PnP(doc)

transform -config NONE <path_to_file> OR | transform -config NONE

Should be simple enough to write a script to do the idempotency check.

Apr 30 '19 17:04 alin04

@alin04 Good call.

@Gregable Sure. When Google starts improving its error reporting around SXG, one of the error messages will be "malformed HTML" caused by non-idempotency of the PnP function. The reason for this validation step is to ensure that the server-side AMPHTML validator is looking at the same interpretation of the HTML that browsers would. (The parse-n-print function does a few other sanitizations to the DOM, like removing comments, that affect how browsers interpret the content.)

We've done as much work as possible on our PnP function to make it idempotent for as many HTML documents as possible. However, not all HTML documents can be represented by a canonical, valid HTML serialization.

broken-h2 demonstrates a serialization that produces an invalid document (h2 inside h2), using the adoption agency algorithm. But you can see from the zeroth example that naive reserialization of this invalid DOM, when parsed again, would not produce the same result.
broken-form demonstrates a serialization that produces another invalid document (form inside form), using the logic in form end tag handling.

In order to produce canonical HTML serializations that are equivalent to the original document, we would need to produce invalid HTML, by reversing the various tree-munging algorithms (such as the above) in the spec. This would be harrrd (likely including having to upstream some additional instrumentation into the golang html parser). It may also go against our goal of ensuring the internal validator is looking at the same version of the HTML as browsers would—there are likely latent bugs in the internal validator's HTML parser w.r.t. parse error handling (maybe even some browsers?).

So, the "malformed HTML" error condition must remain. However, the client-side AMPHTML validator does not catch these errors. We should provide a tool to enable publishers to detect them before publication, and to diagnose/root-cause them after detection (by this tool or by Google).

Apr 30 '19 18:04 twifkak

This tool would be most useful if it could distinguish between the various types of payload errors:

invalid UTF-8
UTF-8 containing U+0000 NULL or a codepoint that causes a parse error
non-idempotent parse-and-print, via transform -config NONE

Bonus points for highlighting the byte position(s) of the error(s).

(Trying to heuristically identify the "root cause" that causes parse-and-print error is out of scope, but hey, triple bonus points.)

Oct 17 '19 17:10 twifkak

amppackager amppackager copied to clipboard

Make a CLI tool for diagnosing payload errors

amppackager
amppackager copied to clipboard