Make wpt own the HTML parser test data and remove dependency on html5lib-python, html5lib-tests
This week I've done the exercise of updating HTML parser tests again, though this time I was a bit more successful in figuring out how to get those changes through to wpt (see #2887). But boy is it painful and also mostly undocumented!
- Make the test data change to
html5lib-tests(in the custom test data format) https://github.com/html5lib/html5lib-tests/pull/133 - Update
html5lib-python's submodule ofhtml5lib-testsAND update.pytest.expect(manually?) so that html5lib itself doesn't fail the changed tests without having them marked as expected failures. https://github.com/html5lib/html5lib-python/pull/531 - Update the commit hash for
html5lib-pythonin wpt'shtml/tools/build.shand generate tests inwptby runninghtml/tools/build.sh. https://github.com/web-platform-tests/wpt/pull/27799
Juggling 3 repos for one change like this doesn't seem ideal for contributors. From wpt's perspective, what I would like instead is:
- Make the test data change in
wptand run a script to generate tests. No dependency on html5lib.
Then html5lib-python can get the tree-builder test data from wpt instead of from html5lib-tests.
Thoughts? @gsnedders @jgraham @annevk @stephenmcgruer
this is effectively a dupe of https://github.com/html5lib/html5lib-tests/issues/127 fwiw
@gsnedders oh, right, I had forgotten about that! It seems like there isn't objection. Are you still planning to work on this?
@gsnedders oh, right, I had forgotten about that! It seems like there isn't objection. Are you still planning to work on this?
It is a long way down my list.
A tweak we can make is to depend on html5lib-tests instead of html5lib-python from wpt, which would remove the second step. (I think this was @jgraham 's idea, but don't see it mentioned in GitHub.)
One obvious (easy) tweak given it's using git-submodules is to explicitly store a commit hash somewhere in WPT and then during update cd html5lib-python/html5lib/tests/testdata && git fetch origin && git checkout $REV.
My main concern is that I want to preserve the file format for the preferred form form making modifications to the test, since there are non-WPT consumers of those formats.
I'm not a fan of WPT having a build step that transforms the tree builder test format. FWIW, Gecko's mochitest harness stores the original .dat format in the repo and parses it when the tests are run.
Having the sources files in the same format in wpt and parsing them with JS when running sounds ideal actually. Can that parser be migrated to wpt?
Having worked on a parser bug in WebKit I now think this would be even more valuable than I previously thought. It looks like Chromium and WebKit both have two sets of parser tests in the tree:
- Some html5lib-tests fork of unspecified vintage
- web-platform-tests's import of html5lib-tests
And the former has tests the latter might not contain. I contributed further to this problem in https://github.com/WebKit/WebKit/pull/12019, but am willing to be part of the cleanup crew if we make web-platform-tests the true home of HTML parser tests.
I suspect @mfreed7 might be interested in this from the Chromium side. Copying here to gather interest.
I'm definitely supportive of the effort to clean this up, and make WPT the source of truth for parser tests.
Steps taken thus far:
- Upstreamed WebKit-specific tests: https://github.com/html5lib/html5lib-tests/commit/4f45c0211cf1d1f1af319470f77851f60f29914c
- Working on a new import in https://github.com/web-platform-tests/wpt/pull/39305
I wonder if @zcorpan is still interested in taking this even further as I think it would definitely be preferable if we didn't have to go via html5lib-tests.
https://github.com/html5lib/html5lib-tests does have a number of actionable issues and stale PRs worth triaging. Help appreciated.
Yes. See https://github.com/html5lib/html5lib-tests/issues/127#issuecomment-1490501826 and later comments.
@zcorpan any progress on this?
Not yet but it's on my list.
Friendly bump-up.
html5lib-python seems pretty much dead. Last commit was Feb 2024. Even removal of six and other PRs are open for over a year now. It's time we look for alternatives.
I wonder if we can move to html.parser like pip did(https://github.com/pypa/pip/pull/10291).