ecmarkup Linting the spec

I would like to start enforcing some basic sanity checks for the spec. Some of them will end up implemented in other projects, but I would like to use this issue to track possible lint rules everywhere.

The main goal of this is to reduce editorial churn and correctness issues, but I know that it would also be helpful to people who build tooling based on the spec, like https://github.com/jscert/jsexplain (so cc @brabalan in case you have any ideas).

Currently these are mostly enforced by editors noticing them or @jmdyck submitting after-the-fact PRs, which isn't sustainable.

Here's a few possibilities to get started. Please suggest more.

Output is valid HTML
~No trailing whitespace~
In algorithms:
- Consistent spacing, e.g. ~foo ( a, b ) in algorithm headers and~ foo(a, b) in algorithm steps.
- Parameter names are surrounded by _
- ~~If an If has substeps, the If line ends with , then~~
- If an If has an Else which has substeps, it is spelled Else, not Otherwise
- If an If has an else on the same line, it is spelled ; else rather than , else or ; else, or ; otherwise (we are inconsistent about else vs otherwise here, and I am OK not enforcing it, but we should at least enforce the semicolon and lack of following space).
- ~~Lines end in ., :, ,, do, or then~~
- ~~Any line not ending in . is followed by an indented sequences of steps.~~
- Consistent casing for variable names - currently we're very bad about this one
- If we say "A and B" or "A, B, and C", the commas are as they are in those examples (rather than "A, and B" or "A, B and C")
- Consistently "let x be" and "set x to" (e.g. not "let x to", "set x be", or "set x to be")
- Ban If _foo_ is present, let _bar_ be _foo_; else let _bar_ be *undefined*. (https://github.com/tc39/ecma262/pull/1411) [edit: actually that PR only applied to functions, not AOs, so we can only do this if we first sort out the treatment of missing parameters in AO e.g. by banning optional parameters in AOs)
- "... to a new empty List" rather than "to a new List", "to be a new List", "to be a new empty List"
  - Probably something similar for records
- The last step should not be Return., since that's implicit (as of https://github.com/tc39/ecma262/pull/2397).
Grammar lookahead restrictions and flags are omitted in early error definitions and syntax-directed operations
In the grammar:
- Lookahead restrictions and flags do not have spaces between brackets (so [lookahead != `let`] or [+Await] rather than [ lookahead != `let` ] or [ +Await ], etc)
- One space before the : on the LHS
- One space between each terminal or nonterminal on the LHS
- ~Grammar flags are consistent: every ?-prefixed flag on the RHS appears as a parameter flag on the LHS, and every LHS flag is passed down at least one nonterminal in one production on the RHS, or is used to gate a production (see more about what grammar flags mean here)~
A bunch of HTML consistency stuff:
- ~~tags are lowercase~~ https://github.com/tc39/ecmarkup/pull/367
- ~~attribute values are quoted (or unquoted; I don't care as long as we're consistent)~~
- tags don't have any unexpected attributes (e.g.)
- tags have all the expected attributes (e.g. <emu-xref="foo>" should be caught, since they meant <emu-xref href="foo">)
- for tags for which the closing tag is optional, they are always included (or not included; again, as long as we're consistent)
- ~~≥ is spelled ≥ (etc)~~ https://github.com/tc39/ecmarkup/pull/481
- ~~no unknown emu tags~~ https://github.com/tc39/ecmarkup/pull/279
- sec- is only a prefix for an ID when attached to a clause (cf https://github.com/tc39/ecma262/pull/2103)
- all rows in a table have the same number of cells (account for colspan)
- ~~consistent indentation~~
- emu-grammar and emu-alg tags are not adjacent with others of their kind
- ~no more than one blank line in a row~
 - ~and maybe consistent rules about where blank lines go, at least in some cases: e.g., never between <emu-clause> and <h1>.~
- Consistent spacing for records: exactly the spacing in { [[key]]: value, [[key2]]: value2 }
Consistent spelling
- ~the *this* value, not *this* value or the *this* object~
  - we have a lint for "this object" but not no-"the" "this value" because the latter form is still in use (generally as an argument to an AO call)
- ~British vs American spelling for words where it's an issue, like "behaviour"~
- ~"one's complement" and "two's complement", not "1's complement" or "2's complement"~
- "uppercase" and "lowercase", not "upper case" or "upper-case" (https://github.com/tc39/ecma262/pull/2598)
Annex A ("Grammar Summary") has emu-prodrefs to all productions (or, ideally, is automatically generated)
As of https://github.com/tc39/ecma262/pull/1914, every abstract operation has a preamble in the correct format (though, what is an abstract operation from the perspective of ecmarkup? - I guess an emu-clause with an AOID is a reasonable heuristic.)
When the steps or prose for a syntax-directed operation refer to the name of a nonterminal, it is surrounded with |, as in |UnicodeLeadSurrogate|.
In syntax-directed operations,
- All referenced nonterminals occur in the production for the SDO.
- In the algorithm steps or prose, opt subscripts and grammar parameters are not included.
Miscellaneous stuff:
- ~~Always *+0*𝔽 or *-0*𝔽, never *0*𝔽. ~~(https://github.com/tc39/ecmarkup/pull/257)
- "be the Record {", not "be a new Record {"
- Never *+1* for any string of digits except 0.
- ~~Exactly one space between sentences.~~
-  is not followed by a linebreak, and  is not preceded by a linebreak (even with intervening whitespace).
- ~~[Cc]lauses? \d should be forbidden.~~
- An inline if does not have a then: If foo, return false. not If foo, then return false.
- "ECMAScript language value", not "ECMAScript value"
No namespace collisions between constants and AOs (and maybe other namespaces)
All AOs have structured headers (once people have had time to get used to the new syntax)
No unnecessary explicit suppressions / annotations for can-call-user-code
Every algorithm returns.
consistently "a number or a bigint", in that order, in types: https://github.com/tc39/ecma262/pull/2622
~~no unused Let bindings or captured variables in AOs: https://github.com/tc39/ecma262/pull/2836~~ https://github.com/tc39/ecmarkup/pull/483
If _x_ is *null*, return *null* rather than If _x_ is *null*, return _x_, per https://github.com/tc39/ecma262/pull/3122 - anywhere there's a comparison against a thing in *s or ~s, or a literal number or +/-∞, and then the alias is returned, we should return the constant instead.

It would also be nice to have a few more static-analysis-y checks, like

The grammar should be unambiguous
- And LR(k)
Typechecking for abstract operations
- All used operations are defined
- ~~They are invoked with the right number of arguments~~
- ~~They don't reference any values which are not defined~~ https://github.com/tc39/ecmarkup/pull/483
  - And when updating an already defined value, this is done with set rather than let, increase, increment, add, etc.
- ~~Their return value is treated as a completion record or not as a completion record, as appropriate (pending https://github.com/tc39/ecma262/issues/1796)~~
- In an ideal world, actual typechecking for values
  - Given that, enforce the * vs ~ vs _ vs " rules for referring to different kinds of values
  - As a particular case, algorithms should say If _x_ is *true*, not If _x_ is true.
~~Grammar productions have all and only the appropriate flags~~
~~All syntax-directed operations correspond to actual productions~~

Edit: some of these are done in #199, #205, #207, #209, #210, and #239. I've struck them from the above list. I'm keeping this issue open to track remaining items.

Feb 24 '20 19:02 bakkot

All used operations are defined

This would need to be amended for AOs that are only called upstream, but at least it would force us to enumerate such AOs.

Feb 24 '20 19:02 syg

As an additional step here, I would like for us to define a bunch of "pattern matchers" of some sort that identify idiomatic spec phrasing, and use a special rendering of the spec that highlights/styles these idiomatic sections of text so that we can identify algorithm steps and whatnot that (unexpectedly) do not conform. We might even want to track/relate the idiomatic phrasing in the same way we do AO references.

Feb 24 '20 20:02 michaelficarra

I'll point out that ecmaspeak-py does almost all of what @bakkot listed, and more. Not that that helps ecmarkup a whole lot, but you're welcome to adapt my code.

Feb 25 '20 01:02 jmdyck

@jmdyck Can you expand on the "and more"? What other things does it do that aren't listed in @bakkot's comment?

edit: I see you posted a link. I'll check out the code.

Feb 25 '20 03:02 michaelficarra

In addition to the things @bakkot listed, ...

analyze_spec.py checks that...

Only expected elements are used, and that their content conforms to expected content models.
re 'id' attributes:
- There are no duplicates.
- They conform to certain expectations (e.g., the 'id' of an emu-table should start with "table-").
- A name isn't both an 'id' and an 'oldid'.
- There's a defining 'id' for every id-reference.
- Defined ids (with many exceptions) are referenced somewhere in the spec.
Some of the above similarly for 'aoid' attributes.
emu-tables with certain kinds of caption have expected column-heads.
The content of the well-known intrinsics table conforms to certain expectations.
Each %Foo% in the spec (outside the intrinsics table) resolves to one in the intrinsics table.

Section.py checks that...

clause-headers conform to certain expectations.
Within certain clauses (e.g., the properties of object Foo), sub-clauses are in 'alphabetical order'.

emu_grammars.py checks that...

Every emu-grammar element is syntactically well-formed.
Every production in a 'definition' emu-grammar "makes sense" in various ways, especially in its use of grammatical parameters.
The total set of defining productions makes sense as a whole.
Every production in a non-definition emu-grammar is a version of some defining production. (In general, it needn't be an exact copy of one, for various reasons.)
Outside of emu-grammar elements, references to nonterminals are well-defined.
(It also generates a good approximation to Annex A.)
(It also has in-progress code to generate a parser.)

Pseudocode.py ...

Attempts to parse every chunk of pseudocode (emu-alg, early error, emu-eqn, inline SDO definition), and complains about any failures. (See *.grammar for the grammars it uses.)
(Builds data structures for use in static analysis.)
Builds a static call graph and looks for anomalies.
Attempts to check SDO coverage (See tc39/ecma262#1301)

static_type_analysis.py...

Extracts parameter info + return type info from preambles (and complains about any anomalies it sees).
Attempts to perform an interprocedural static type analysis of all the pseudocode (and complains about any type errors it detects).
Generates a version of the spec with 'algorithm headers' (See tc39/ecma262#545)

Feb 25 '20 04:02 jmdyck

@bakkot

Grammar lookahead restrictions and flags are omitted in early error definitions and syntax-directed operations

I think you mean included, right?

Mar 30 '20 21:03 michaelficarra

If we decide they should be universally included, sure.

Mar 30 '20 21:03 bakkot

@bakkot I thought that's what we decided in the editor call. We even have it on our editor update slides.

Mar 30 '20 21:03 michaelficarra

I only remembered that we were vaguely in support of it, not that we necessarily intended to actually change it. But if we did, sure, sounds good.

Mar 30 '20 21:03 bakkot

I'm not good at remembering those sorts of things without notes. We can confirm at the next call.

edit: In a later editor call, we did in fact confirm that universal inclusion is desired.

Mar 30 '20 21:03 michaelficarra

emu_grammars.py checks that...

Every emu-grammar element is syntactically well-formed.

Every production in a 'definition' emu-grammar "makes sense" in various ways, especially in its use of grammatical parameters.

The total set of defining productions makes sense as a whole.

Every production in a non-definition emu-grammar is a version of some defining production. (In general, it needn't be an exact copy of one, for various reasons.)

Outside of emu-grammar elements, references to nonterminals are well-defined.

(It also generates a good approximation to Annex A.)

(It also has in-progress code to generate a parser.)

grammarkdown already does a lot of this, ecmarkup just doesn't employ those features.

Apr 03 '20 00:04 rbuckton

Feel free to file bugs against grammarkdown for anything you think needs to be added to the parser, checking, etc.: https://github.com/rbuckton/grammarkdown

Apr 03 '20 00:04 rbuckton

Nesting more than one AO invocation as parameters to another AO should be disallowed. See https://github.com/tc39/ecma262/pull/1573#issuecomment-553719055

Apr 06 '20 22:04 michaelficarra

To confirm, it seems like they should be disallowed only because the current macro doesn't handle them, not because they're inherently something we don't want?

Apr 06 '20 23:04 ljharb

The issue there isn't about the macro, it's the fact that there's no evaluation order for the arguments defined, and when they're abstract operations they can have side effects which must be specified to happen in a specific order.

Apr 07 '20 00:04 bakkot

In algorithm steps, all aliases should be introduced using Let or with a parameter list before they are used. (see tc39/ecma262#1954)
Forbid introduction of aliases that are never referenced.

Apr 20 '20 16:04 michaelficarra

In algorithm steps, each alias should only be introduced with Let at most once for any possible trace through the algorithm.

Apr 21 '20 18:04 michaelficarra

In algorithm steps, all aliases should be introduced using Let or with a parameter list before they are used.

There are other ways we introduce aliases. Other than those Let and parameter lists (including parameter lists of abstract closures), we also have various permutations on:

For each _x_ (For each element _k_ of _keys_, For each integer _k_ that satisfies ..., etc)
Evaluate |Foo| to obtain _x_, which is used only in regexes and is complicated by the fact that some of the regex operations return multiple results
If there exists _x_ (e.g.)
the smallest possible integer _k_ not smaller than (e.g.)
For yield we say something like when evaluation is resumed with a Completion _resumptionValue_ the following steps will be performed, which introduces _resumptionValue_ for use in those steps
For MakeDay we say Find a value _t_ such that

Also, we occasionally refer to aliases from other algorithms. For example, the Function constructor says Let _args_ be the _argumentsList_ that was passed to this function by [[Call]].

Also also, some places we have algorithms (especially but not uniquely in Annex B, e.g. in the Note at the end of Number.prototype.toExponential) we're actually defining replacements to existing steps in to other algorithms, in which case it is not possible to locally determine which aliases are in scope.

May 10 '20 04:05 bakkot

Also, the things which are currently console.log warnings should become linting errors, ideally.

Edit: done in https://github.com/tc39/ecmarkup/pull/229.

Jun 14 '20 17:06 bakkot

We should figure out if we want to use html entities for non-ASCII stuff, and enforce that decision. Currently we use both « and «.

Jul 14 '20 22:07 bakkot

Since I'm on a Mac, it's easy to type the actual characters, so I'd prefer that; but if others editing the spec aren't, it might not be as easy.

Jul 14 '20 22:07 ljharb

It's 2020, embrace the Unicode.

Jul 14 '20 23:07 michaelficarra

We should figure out if we want to use html entities for non-ASCII stuff, and enforce that decision.

Why not use ASCII quotes and convert automatically in build time? Here is how markdown-it does this.

If an If has an else on the same line, it is spelled ; else rather than , else or ; else, or ; otherwise

Instead of enforcing consistency, how about adding tags for this, e.g.

1. <emu-if><cond>x is 1 and y is 1</cond><then>return 1</then></emu-if>
1. <emu-else>return 0</emu-else>

and

1. <emu-if-multiline><cond>_exponent_ is *+&infin;*<sub>𝔽</sub></cond><then>
  1. <emu-if><cond>abs(ℝ(_base_)) &gt; 1</cond><then>return *+&infin;*<sub>𝔽</sub></then></emu-if>
  1. <emu-if><cond>abs(ℝ(_base_)) is 1</cond><then>return *NaN*</then></emu-if>
  1. <emu-if><cond>abs(ℝ(_base_)) &lt; 1</cond><then>return *+0*<sub>𝔽</sub></then></emu-if>
</then></emu-if-multiline>
1. <emu-else-multiline>...</emu-else-multiline>

This would allow enforcing a consistent style, would also allow authors to not have to remember where commas/semicolons go, and would also let you change the style across the entire document anytime (after all ifs have been converted of course).

May 06 '21 20:05 NotWearingPants

Instead of enforcing consistency, how about adding tags for this

The ecmarkdown page says:

Some of Ecmarkdown's biggest benefits are when using it to write algorithm steps, without the many formalities HTML requires for list items and inline formatting of common algorithmic constructs.

Adding tags as you suggest would make algorithms harder to write + edit (and read, in source).

May 06 '21 22:05 jmdyck

ecmarkup ecmarkup copied to clipboard

Linting the spec

ecmarkup
ecmarkup copied to clipboard