amppackager
                                
                                 amppackager copied to clipboard
                                
                                    amppackager copied to clipboard
                            
                            
                            
                        Make a CLI tool for diagnosing payload errors
Tool would take AMPHTML on stdin, run parse-n-print (Process with no transformers) twice in series, and then return nonzero status if PnP(doc) != PnP(PnP(doc)). Optionally, output some sort of human-readable diff.
Can you add some context to this issue about what the use case would be for this tool.
cmd/transform/transform.go exists already to do PnP(doc)
transform -config NONE <path_to_file>
OR
Should be simple enough to write a script to do the idempotency check.
@alin04 Good call.
@Gregable Sure. When Google starts improving its error reporting around SXG, one of the error messages will be "malformed HTML" caused by non-idempotency of the PnP function. The reason for this validation step is to ensure that the server-side AMPHTML validator is looking at the same interpretation of the HTML that browsers would. (The parse-n-print function does a few other sanitizations to the DOM, like removing comments, that affect how browsers interpret the content.)
We've done as much work as possible on our PnP function to make it idempotent for as many HTML documents as possible. However, not all HTML documents can be represented by a canonical, valid HTML serialization.
- broken-h2 demonstrates a serialization that produces an invalid document (h2insideh2), using the adoption agency algorithm. But you can see from the zeroth example that naive reserialization of this invalid DOM, when parsed again, would not produce the same result.
- broken-form demonstrates a serialization that produces another invalid document (forminsideform), using the logic in form end tag handling.
In order to produce canonical HTML serializations that are equivalent to the original document, we would need to produce invalid HTML, by reversing the various tree-munging algorithms (such as the above) in the spec. This would be harrrd (likely including having to upstream some additional instrumentation into the golang html parser). It may also go against our goal of ensuring the internal validator is looking at the same version of the HTML as browsers would—there are likely latent bugs in the internal validator's HTML parser w.r.t. parse error handling (maybe even some browsers?).
So, the "malformed HTML" error condition must remain. However, the client-side AMPHTML validator does not catch these errors. We should provide a tool to enable publishers to detect them before publication, and to diagnose/root-cause them after detection (by this tool or by Google).
This tool would be most useful if it could distinguish between the various types of payload errors:
- invalid UTF-8
- UTF-8 containing U+0000 NULL or a codepoint that causes a parse error
- non-idempotent parse-and-print, via transform -config NONE
Bonus points for highlighting the byte position(s) of the error(s).
(Trying to heuristically identify the "root cause" that causes parse-and-print error is out of scope, but hey, triple bonus points.)