html icon indicating copy to clipboard operation
html copied to clipboard

Static checker for the spec?

Open jmdyck opened this issue 1 year ago • 10 comments

What is the issue with the HTML Standard?

While I imagine that the build process detects certain kinds of errors in the source file, it clearly fails to detect a lot too.

I've been looking at the feasibility of developing a static checker for the HTML spec. It seems to me that chunks of the spec (in particular, the algorithms) are formulaic enough that they might be amenable to parsing, followed by static analysis. This could potentially detect:

  • typos
  • incorrect or missing markup
  • argument/parameter mismatches (e.g., missing args, extra args, out-of-order args)
  • variable errors (use-before-definition, defined with no use)
  • "type" mismatches

(The idea is roughly similar to what I've done for the ECMAScript spec over at my ecmaspeak-py repo.)

So, two questions:

  • Has this idea come up before? If so, I'd be interested to see the discussion. (I looked in issues and PRs (open and closed) for anything like this, but didn't find anything. I particularly combed through the labels "spec tooling" and "meta".)

  • Would the editors be amenable if I were to suggest tweaks to the spec to make this more feasible? I don't think it would be anything major, I would mostly just identify cases where the spec uses multiple different ways to express the same thing, and have the editors pick the approved way(s) to express it.

jmdyck avatar Jul 15 '24 21:07 jmdyck

Awesome to see you over here. Given your recent PRs, and how you've improved the JS spec, I was hoping you'd post something like this :)

There are some previous discussions on small aspects of this in https://github.com/whatwg/html-build/issues and https://github.com/whatwg/meta/issues . E.g. https://github.com/whatwg/meta/issues/190 or https://github.com/whatwg/html-build/issues/89 . But nothing too comprehensive.

We would definitely be amenable to making the spec more consistent and accepting PRs that do so.

If we were to make this a more official/required part of the CI process, I'd have a few small things to raise:

  • How does this integrate with our existing lint.sh? Can it replace or subsume it?
  • Right now Python is not a required dependency for html-build. Indeed, we've been trying to write new parts of the build tooling in Rust. Ideally any new tooling would be part of that Rust pipeline. But, having tooling is better than not having tooling, so if Python is your preference, go for it!
  • HTML is very special in various ways. But, some proposals are probably applicable across all WHATWG specs. If there were a way to factor those shared checks out and run them on all WHATWG specs, that would be a nice bonus. (Or at least all WHATWG specs where the editor wants to be consistent with HTML.)
  • Since you mentioned the algorithms: for non-HTML specs, where we use Bikeshed, we try to add <div algorithm>/</div> wrappers (which Bikeshed converts to <div class="algorithm">/</div>). This has two consequences: 1) Bikeshed checks every <var> is used at least once, to catch typos; 2) Bikeshed includes a script which lets you click on a variable to highlight other uses of it. It'd be awesome to do at least 2) for HTML, and maybe 1), but nobody has taken the time to wrap up the algorithms.

domenic avatar Jul 16 '24 04:07 domenic

This sounds great! The lint rules for non-HTML are in https://github.com/whatwg/whatwg.org/blob/main/resources.whatwg.org/build/deploy.sh FWIW.

annevk avatar Jul 16 '24 07:07 annevk

This has two consequences: 1) Bikeshed checks every var is used at least once, to catch typos; 2) Bikeshed includes a script which lets you click on a variable to highlight other uses of it. It'd be awesome to do at least 2) for HTML, and maybe 1), but nobody has taken the time to wrap up the algorithms.

2 is done by an isolated JS script, too, which should be very easy to drop in: https://github.com/speced/bikeshed/blob/main/bikeshed/stylescript/var-click-highlighting.js (and https://github.com/speced/bikeshed/blob/main/bikeshed/stylescript/var-click-highlighting.css for the supporting CSS)

1 is done in Python, but the algo is fairly simple and should be an easy port to JS or Rust: https://github.com/speced/bikeshed/blob/main/bikeshed/unsortedJunk.py#L134 (all the h.* methods are over in the dom module)

tabatkins avatar Jul 16 '24 15:07 tabatkins

If we were to make this a more official/required part of the CI process,

I don't know. I'm expecting (based on my experience with the ECMAScript spec) that this tool will produce a lot of 'false positives', which presumably is not something you want in a CI process.

  • How does this integrate with our existing lint.sh? Can it replace or subsume it?

Possibly.

At first I thought no, because lint.sh greps through the full text of the spec, whereas I'm thinking that my checker would focus only on those chunks where it has a hope of 'understanding' what the text is saying, and skip everything else. But I suppose instead of skipping those other chunks entirely, it could at least do a grep for certain patterns like lint.sh does.

  • Right now Python is not a required dependency for html-build. Indeed, we've been trying to write new parts of the build tooling in Rust. Ideally any new tooling would be part of that Rust pipeline. But, having tooling is better than not having tooling, so if Python is your preference, go for it!

I do like Python, but over the course of the ecmaspeak project, I've become somewhat frustrated with it, and the reasons would probably apply to this project too.

I'm interested in Rust, but the learning curve looks steep, and I'm doubtful that this is a good first Rust project.

I've used Crystal for some preliminary work, but I imagine you'd be even less keen on adding that to the build pipeline.

  • [...] If there were a way to factor those shared checks out and run them on all WHATWG specs, that would be a nice bonus. (Or at least all WHATWG specs where the editor wants to be consistent with HTML.)

I agree, but I think it'll to be hard enough just to target the HTML spec. Applicability to other specs would be farther down the road.

  • for non-HTML specs, [...], we try to add <div algorithm>/</div> wrappers [...], but nobody has taken the time to wrap up the algorithms [in the HTML spec].

I could certainly start by wrapping up the algorithms in the HTML spec, as that would make them easier to find for all my subsequent work. But it raises a bunch of questions, which I think I'll pose in a separate issue.

jmdyck avatar Jul 17 '24 01:07 jmdyck

I could certainly start by wrapping up the algorithms in the HTML spec, as that would make them easier to find for all my subsequent work. But it raises a bunch of questions, which I think I'll pose in a separate issue.

In order to figure out all the questions, I started to do the work, and just kept going. So it'll probably be a draft PR rather than an issue. But something else has come up, so it'll probably be a few weeks before I get back to it.

jmdyck avatar Jul 23 '24 13:07 jmdyck

Re wrapping each algorithms in a <div algorithm> (or something similar)...

When doing such a wrapping, we presumably want to include the algorithm's name (if any), parameters (if any), and instructions/steps.

In lots of cases this is straightforward. However, in some cases it isn't.


(1) dt/dd

Cases where the facets of an algorithm are spread over a paired <dt> and <dd> in a <dl>.

(You can't wrap a <div algo> around a <dt> + <dd> pair.)

E.g., consider 2.6.1 Reflecting content attributes in IDL attributes, under "For a reflected target that is an element element, ...", there's a <dl> containing the 4 associated algorithms. For each algorithm:

  • Its name (and parameter, for set the content attribute) appear in a <dt>.
  • Its steps appear in a <dd>.
  • The 'target variable' / pseudo-parameter (element) appears in a <p> before the <dl>.

Short of rearranging all of the above, the only way to wrap these is to put the whole <p> + <dl> in a <div algo>, but then that's putting 4 algorithms into one <div algo>, which is probably suboptimal.

Other similar cases:

  • 2.7.1 "Serializable objects", in the example
  • 7.2.2.6 "Script settings for Window objects", in step 3
  • 8.1.4.2 "Fetching scripts", for "set up the {classic,module} script request"
  • 8.9.1.1 "Client identification", for the behavior of NavigatorID attributes
  • 8.9.1.2 "Language preferences", for the behavior of NavigatorLanguage attributes
  • 10.2.6.2 "Script settings for workers", in step 4
  • 11.3.1.3 "Script settings for worklets", in step 6

I don't think these have to use a <dl>. E.g., the example in 2.7.1 defines [de]serialization steps in a <dl>, but other places in the spec that define them just use a <p> + <ol> for each algorithm. And 8.9.1.{1,2} give IDL attribute behavior in a <dl>, but lots of places define attribute behavior without using a <dl>.

So I'd be inclined to explode these <dl> elements and replace each <dt> + <dd> pair with a <div algo>. Mind you, that would change the rendering. So maybe <div alg> could have/allow substructure that would enable a nice rendering.


(2) hN

Cases where the algorithm's name + parameters appear in a heading element (e.g., <h4>).

To be clear, I think it would be valid to wrap the <hN> + <ol> in a <div algo>, but I'm not sure how well that would go over. (E.g., would it confuse Wattsi?)

Specifically:

Some of these have a <p> after the heading element that repeats the operation name and (less often) the parameters, in which case you could div-wrap the <p> + <ol> instead of including the <hN> in the div. So that could be done for all, creating <p> elements as necessary. But note that then the <div algo> doesn't contain the <dfn> for the algorithm.


(3) w-nodev

Cases where, between an algorithm's preamble <p> and steps <ol>, there's a <div w-nodev> tag, so you can't just wrap a div around the <p> + <ol>.

Two cases:

jmdyck avatar Jun 11 '25 22:06 jmdyck

(You can't wrap a <div algo> around a <dt> + <dd> pair.)

Yes you can - <div> is now allowed in the content model of <dl>, specifically to let you group dt/dd.

Cases where the algorithm's name + parameters appear in a heading element (e.g., <h4>).

Should be able to use a <section algo> wrapper around the heading+content, then.

tabatkins avatar Jun 11 '25 23:06 tabatkins

<div> is now allowed in the content model of <dl>, specifically to let you group dt/dd.

Huh, that's odd, I did specifically look at the dl content model. Just not very well, I guess. Anyhow, neat!

Cases where the algorithm's name + parameters appear in a heading element (e.g., <h4>).

Should be able to use a <section algo> wrapper around the heading+content, then.

Yeah, but using <section> elements in the source file is not really feasible (unless we overhaul/replace Wattsi): https://github.com/whatwg/html/issues/5649#issuecomment-2878521319

jmdyck avatar Jun 12 '25 00:06 jmdyck

(1) dt/dd

Sounds like we're OK in this case. I do think you've identified some cases where dl was a strange choice, so let me give my personal aesthetic take on them.

  • 2.6.1 Reflecting content attributes in IDL attributes

dl seems reasonable, there's a list of algorithms that appears twice consistently.

  • 2.7.1 "Serializable objects", in the example

Using dl seems strange here and doesn't match other serialization steps defined in the spec, e.g. for ImageData.

  • 7.2.2.6 "Script settings for Window objects", in step 3
  • 10.2.6.2 "Script settings for workers", in step 4
  • 11.3.1.3 "Script settings for worklets", in step 6

Definitely a good fit for dl. Also I'm not sure I'd classify these as <div algorithm> cases since they're closures inside a larger algorithm.

  • 8.1.4.2 "Fetching scripts", for "set up the {classic,module} script request"

dl is a bit strange here, a more normal algorithm declaration like https://infra.spec.whatwg.org/#example-algorithm-declaration-short would make more sense to me.

  • 8.9.1.1 "Client identification", for the behavior of NavigatorID attributes

These are very old style and should use the modern "getter steps" style.

  • 8.9.1.2 "Language preferences", for the behavior of NavigatorLanguage attributes

Ditto, although the definition of languages is even worse than usual.

(2) hN

I personally have leaned away from using headings as algorithm names recently. I think especially for abstract operation names it looks ugly, doesn't match our usual style, and crowds up the table of contents.

If I were to re-do these, I would just convert them to <dfn>s, and get rid of most of the headings. E.g. [[Call]] can go direclty under 2.6.3.1, and 2.7.3-2.7.8 can all get lumped into a new subsection under 2.7 of "Structured serialization and transfer algorithms" or something.

Similarly, I think the host hooks (8.1.6) probably could just become <dfn>s under the existing sections. Maybe group 8.1.6.1-8.1.6.5 under a new "General host hooks" subsection.

I'm not sure everyone would agree with the above suggestions though, so it'd be good to solicit more opinions before changing them.

For "Loading a document for inline content that doesn't have a DOM", we've already converted adjacent sections (e.g., "loading media documents") to separate the heading from the algorithm declaration. We should just do that there as well.

Anyway, if we don't go with the "fix everything to use <dfn>" approach, then the best workaround I can suggest is to omit the headings from those particular <div algorithm> cases, with a <!-- TODO --> comment in the source. I agree that Wattsi will probably break if we try to wrap them, like we found when trying to do the <section> work.

(3) w-nodev

  • We could handle this by splitting off the last sentence of the <p>.

Into a new <p>, you mean? Seems like a good idea.

  • We could handle this by extending the scope of the <div algo> to past the end of the <div w-nodev> element.

I think this case is best handled by splitting the w-nodev sections like so:

<div algorithm>
<p>To make use of...</p>

<ol w-nodev>
 ...
</ol>

</div>

<p class="note" w-nodev>Another, perhaps better, ...</p>

Yeah, but using <section> elements in the source file is not really feasible (unless we overhaul/replace Wattsi): #5649 (comment)

FYI, I think the best step here would be to replace the part of Wattsi that splits the existing singlepage HTML spec into a multipage spec.

I tried to vibe-code such a splitter in Rust but didn't succeed in the timebox I set for myself. I wrote up the task in https://github.com/whatwg/html-build/issues/298 if someone else wants to try.

domenic avatar Jun 12 '25 04:06 domenic

See PR #11379

(1) dt/dd

  • 2.7.1 "Serializable objects", in the example

Using dl seems strange here and doesn't match other serialization steps defined in the spec, e.g. for ImageData.

Done.

  • 7.2.2.6 "Script settings for Window objects", in step 3
  • 10.2.6.2 "Script settings for workers", in step 4
  • 11.3.1.3 "Script settings for worklets", in step 6

Definitely a good fit for dl. Also I'm not sure I'd classify these as <div algorithm> cases since they're closures inside a larger algorithm.

Yeah, that gets into the question of what exactly should be wrapped, which I'm leaving for later.

  • 8.1.4.2 "Fetching scripts", for "set up the {classic,module} script request"

dl is a bit strange here, a more normal algorithm declaration like https://infra.spec.whatwg.org/#example-algorithm-declaration-short would make more sense to me.

Done.

  • 8.9.1.1 "Client identification", for the behavior of NavigatorID attributes

These are very old style and should use the modern "getter steps" style.

Done.

  • 8.9.1.2 "Language preferences", for the behavior of NavigatorLanguage attributes

Ditto, although the definition of languages is even worse than usual.

Done.

(2) hN

I personally have leaned away from using headings as algorithm names recently. I think especially for abstract operation names it looks ugly, doesn't match our usual style, and crowds up the table of contents.

If I were to re-do these, I would just convert them to <dfn>s, and get rid of most of the headings.

I'm not positive what you mean by "convert them to <dfn>s", since most have a <dfn> already (and the ones that don't are internal methods, which get multiple definitions, so don't really rate a <dfn>?). I basically just changed the <hN> tags to <p> tags.

2.7.3-2.7.8 can all get lumped into a new subsection under 2.7 of "Structured serialization and transfer algorithms" or something.

Done, with slightly different wording.

Maybe group 8.1.6.1-8.1.6.5 under a new "General host hooks" subsection.

Done.

I'm not sure everyone would agree with the above suggestions though, so it'd be good to solicit more opinions before changing them.

I'm hoping a PR constitutes a solicitation for opinions.

For "Loading a document for inline content that doesn't have a DOM", we've already converted adjacent sections (e.g., "loading media documents") to separate the heading from the algorithm declaration. We should just do that there as well.

Done. (That's the only one that isn't just changing hN to p.)

(3) w-nodev

[4.13.4]

  • We could handle this by splitting off the last sentence of the <p>.

Into a new <p>, you mean? Seems like a good idea.

Done.

[11.2.2] I think this case is best handled by splitting the w-nodev sections like so: ...

Done.

jmdyck avatar Jun 14 '25 15:06 jmdyck