ecmarkup
ecmarkup copied to clipboard
Linting the spec
I would like to start enforcing some basic sanity checks for the spec. Some of them will end up implemented in other projects, but I would like to use this issue to track possible lint rules everywhere.
The main goal of this is to reduce editorial churn and correctness issues, but I know that it would also be helpful to people who build tooling based on the spec, like https://github.com/jscert/jsexplain (so cc @brabalan in case you have any ideas).
Currently these are mostly enforced by editors noticing them or @jmdyck submitting after-the-fact PRs, which isn't sustainable.
Here's a few possibilities to get started. Please suggest more.
- Output is valid HTML
- ~No trailing whitespace~
- In algorithms:
- Consistent spacing, e.g. ~
foo ( a, b )
in algorithm headers and~foo(a, b)
in algorithm steps. - Parameter names are surrounded by
_
- ~~If an
If
has substeps, theIf
line ends with, then
~~ - If an
If
has anElse
which has substeps, it is spelledElse
, notOtherwise
- If an
If
has anelse
on the same line, it is spelled; else
rather than, else
or; else,
or; otherwise
(we are inconsistent aboutelse
vsotherwise
here, and I am OK not enforcing it, but we should at least enforce the semicolon and lack of following space). - ~~Lines end in
.
,:
,,
,do
, orthen
~~ - ~~Any line not ending in
.
is followed by an indented sequences of steps.~~ - Consistent casing for variable names - currently we're very bad about this one
- If we say "A and B" or "A, B, and C", the commas are as they are in those examples (rather than "A, and B" or "A, B and C")
- Consistently "let x be" and "set x to" (e.g. not "let x to", "set x be", or "set x to be")
- Ban
If _foo_ is present, let _bar_ be _foo_; else let _bar_ be *undefined*.
(https://github.com/tc39/ecma262/pull/1411) [edit: actually that PR only applied to functions, not AOs, so we can only do this if we first sort out the treatment of missing parameters in AO e.g. by banning optional parameters in AOs) - "... to a new empty List" rather than "to a new List", "to be a new List", "to be a new empty List"
- Probably something similar for records
- The last step should not be
Return.
, since that's implicit (as of https://github.com/tc39/ecma262/pull/2397).
- Consistent spacing, e.g. ~
- Grammar lookahead restrictions and flags are omitted in early error definitions and syntax-directed operations
- In the grammar:
- Lookahead restrictions and flags do not have spaces between brackets (so
[lookahead != `let`]
or[+Await]
rather than[ lookahead != `let` ]
or[ +Await ]
, etc) - One space before the
:
on the LHS - One space between each terminal or nonterminal on the LHS
- ~Grammar flags are consistent: every
?
-prefixed flag on the RHS appears as a parameter flag on the LHS, and every LHS flag is passed down at least one nonterminal in one production on the RHS, or is used to gate a production (see more about what grammar flags mean here)~
- Lookahead restrictions and flags do not have spaces between brackets (so
- A bunch of HTML consistency stuff:
- ~~tags are lowercase~~ https://github.com/tc39/ecmarkup/pull/367
- ~~attribute values are quoted (or unquoted; I don't care as long as we're consistent)~~
- tags don't have any unexpected attributes (e.g.)
- tags have all the expected attributes (e.g.
<emu-xref="foo>"
should be caught, since they meant<emu-xref href="foo">
) - for tags for which the closing tag is optional, they are always included (or not included; again, as long as we're consistent)
- ~~
≥
is spelled≥
(etc)~~ https://github.com/tc39/ecmarkup/pull/481 - ~~no unknown
emu
tags~~ https://github.com/tc39/ecmarkup/pull/279 -
sec-
is only a prefix for an ID when attached to a clause (cf https://github.com/tc39/ecma262/pull/2103) - all rows in a table have the same number of cells (account for colspan)
- ~~consistent indentation~~
-
emu-grammar
andemu-alg
tags are not adjacent with others of their kind - ~no more than one blank line in a row~
- ~and maybe consistent rules about where blank lines go, at least in some cases: e.g., never between
<emu-clause>
and<h1>
.~
- ~and maybe consistent rules about where blank lines go, at least in some cases: e.g., never between
- Consistent spacing for records: exactly the spacing in
{ [[key]]: value, [[key2]]: value2 }
- Consistent spelling
- ~
the *this* value
, not*this* value
orthe *this* object
~- we have a lint for "this object" but not no-"the" "this value" because the latter form is still in use (generally as an argument to an AO call)
- ~British vs American spelling for words where it's an issue, like "behaviour"~
- ~"one's complement" and "two's complement", not "1's complement" or "2's complement"~
- "uppercase" and "lowercase", not "upper case" or "upper-case" (https://github.com/tc39/ecma262/pull/2598)
- ~
- Annex A ("Grammar Summary") has
emu-prodref
s to all productions (or, ideally, is automatically generated) - As of https://github.com/tc39/ecma262/pull/1914, every abstract operation has a preamble in the correct format (though, what is an abstract operation from the perspective of ecmarkup? - I guess an emu-clause with an AOID is a reasonable heuristic.)
- When the steps or prose for a syntax-directed operation refer to the name of a nonterminal, it is surrounded with
|
, as in|UnicodeLeadSurrogate|
. - In syntax-directed operations,
- All referenced nonterminals occur in the production for the SDO.
- In the algorithm steps or prose,
opt
subscripts and grammar parameters are not included.
- Miscellaneous stuff:
- ~~Always
*+0*<sub>𝔽</sub>
or*-0*<sub>𝔽</sub>
, never*0*<sub>𝔽</sub>
. ~~(https://github.com/tc39/ecmarkup/pull/257) - "be the Record {", not "be a new Record {"
-
Never
*+1*
for any string of digits except0
. - ~~Exactly one space between sentences.~~
-
<p>
is not followed by a linebreak, and</p>
is not preceded by a linebreak (even with intervening whitespace). - ~~
[Cc]lauses? \d
should be forbidden.~~ - An inline
if
does not have athen
:If foo, return false.
notIf foo, then return false.
- "ECMAScript language value", not "ECMAScript value"
- ~~Always
- No namespace collisions between constants and AOs (and maybe other namespaces)
- All AOs have structured headers (once people have had time to get used to the new syntax)
- No unnecessary explicit suppressions / annotations for can-call-user-code
- Every algorithm returns.
- consistently "a number or a bigint", in that order, in types: https://github.com/tc39/ecma262/pull/2622
- ~~no unused
Let
bindings or captured variables in AOs: https://github.com/tc39/ecma262/pull/2836~~ https://github.com/tc39/ecmarkup/pull/483 -
If _x_ is *null*, return *null*
rather thanIf _x_ is *null*, return _x_
, per https://github.com/tc39/ecma262/pull/3122 - anywhere there's a comparison against a thing in*
s or~
s, or a literal number or +/-∞, and then the alias is returned, we should return the constant instead.
It would also be nice to have a few more static-analysis-y checks, like
- The grammar should be unambiguous
- And LR(k)
- Typechecking for abstract operations
- All used operations are defined
- ~~They are invoked with the right number of arguments~~
- ~~They don't reference any values which are not defined~~ https://github.com/tc39/ecmarkup/pull/483
- And when updating an already defined value, this is done with
set
rather thanlet
,increase
,increment
,add
, etc.
- And when updating an already defined value, this is done with
- ~~Their return value is treated as a completion record or not as a completion record, as appropriate (pending https://github.com/tc39/ecma262/issues/1796)~~
- In an ideal world, actual typechecking for values
- Given that, enforce the
*
vs~
vs_
vs"
rules for referring to different kinds of values - As a particular case, algorithms should say
If _x_ is *true*
, notIf _x_ is true
.
- Given that, enforce the
- ~~Grammar productions have all and only the appropriate flags~~
- ~~All syntax-directed operations correspond to actual productions~~
Edit: some of these are done in #199, #205, #207, #209, #210, and #239. I've struck them from the above list. I'm keeping this issue open to track remaining items.
All used operations are defined
This would need to be amended for AOs that are only called upstream, but at least it would force us to enumerate such AOs.
As an additional step here, I would like for us to define a bunch of "pattern matchers" of some sort that identify idiomatic spec phrasing, and use a special rendering of the spec that highlights/styles these idiomatic sections of text so that we can identify algorithm steps and whatnot that (unexpectedly) do not conform. We might even want to track/relate the idiomatic phrasing in the same way we do AO references.
I'll point out that ecmaspeak-py does almost all of what @bakkot listed, and more. Not that that helps ecmarkup a whole lot, but you're welcome to adapt my code.
@jmdyck Can you expand on the "and more"? What other things does it do that aren't listed in @bakkot's comment?
edit: I see you posted a link. I'll check out the code.
In addition to the things @bakkot listed, ...
analyze_spec.py checks that...
- Only expected elements are used, and that their content conforms to expected content models.
- re 'id' attributes:
- There are no duplicates.
- They conform to certain expectations (e.g., the 'id' of an emu-table should start with "table-").
- A name isn't both an 'id' and an 'oldid'.
- There's a defining 'id' for every id-reference.
- Defined ids (with many exceptions) are referenced somewhere in the spec.
- Some of the above similarly for 'aoid' attributes.
- emu-tables with certain kinds of caption have expected column-heads.
- The content of the well-known intrinsics table conforms to certain expectations.
- Each
%Foo%
in the spec (outside the intrinsics table) resolves to one in the intrinsics table.
Section.py checks that...
- clause-headers conform to certain expectations.
- Within certain clauses (e.g., the properties of object Foo), sub-clauses are in 'alphabetical order'.
emu_grammars.py checks that...
- Every emu-grammar element is syntactically well-formed.
- Every production in a 'definition' emu-grammar "makes sense" in various ways, especially in its use of grammatical parameters.
- The total set of defining productions makes sense as a whole.
- Every production in a non-definition emu-grammar is a version of some defining production. (In general, it needn't be an exact copy of one, for various reasons.)
- Outside of emu-grammar elements, references to nonterminals are well-defined.
- (It also generates a good approximation to Annex A.)
- (It also has in-progress code to generate a parser.)
Pseudocode.py ...
- Attempts to parse every chunk of pseudocode (emu-alg, early error, emu-eqn, inline SDO definition), and complains about any failures. (See *.grammar for the grammars it uses.)
- (Builds data structures for use in static analysis.)
- Builds a static call graph and looks for anomalies.
- Attempts to check SDO coverage (See tc39/ecma262#1301)
static_type_analysis.py...
- Extracts parameter info + return type info from preambles (and complains about any anomalies it sees).
- Attempts to perform an interprocedural static type analysis of all the pseudocode (and complains about any type errors it detects).
- Generates a version of the spec with 'algorithm headers' (See tc39/ecma262#545)
@bakkot
Grammar lookahead restrictions and flags are omitted in early error definitions and syntax-directed operations
I think you mean included, right?
If we decide they should be universally included, sure.
@bakkot I thought that's what we decided in the editor call. We even have it on our editor update slides.
I only remembered that we were vaguely in support of it, not that we necessarily intended to actually change it. But if we did, sure, sounds good.
I'm not good at remembering those sorts of things without notes. We can confirm at the next call.
edit: In a later editor call, we did in fact confirm that universal inclusion is desired.
emu_grammars.py checks that...
- Every emu-grammar element is syntactically well-formed.
- Every production in a 'definition' emu-grammar "makes sense" in various ways, especially in its use of grammatical parameters.
- The total set of defining productions makes sense as a whole.
- Every production in a non-definition emu-grammar is a version of some defining production. (In general, it needn't be an exact copy of one, for various reasons.)
- Outside of emu-grammar elements, references to nonterminals are well-defined.
- (It also generates a good approximation to Annex A.)
- (It also has in-progress code to generate a parser.)
grammarkdown
already does a lot of this, ecmarkup
just doesn't employ those features.
Feel free to file bugs against grammarkdown
for anything you think needs to be added to the parser, checking, etc.: https://github.com/rbuckton/grammarkdown
Nesting more than one AO invocation as parameters to another AO should be disallowed. See https://github.com/tc39/ecma262/pull/1573#issuecomment-553719055
To confirm, it seems like they should be disallowed only because the current macro doesn't handle them, not because they're inherently something we don't want?
The issue there isn't about the macro, it's the fact that there's no evaluation order for the arguments defined, and when they're abstract operations they can have side effects which must be specified to happen in a specific order.
- In algorithm steps, all aliases should be introduced using
Let
or with a parameter list before they are used. (see tc39/ecma262#1954) - Forbid introduction of aliases that are never referenced.
In algorithm steps, each alias should only be introduced with Let
at most once for any possible trace through the algorithm.
In algorithm steps, all aliases should be introduced using Let or with a parameter list before they are used.
There are other ways we introduce aliases. Other than those Let
and parameter lists (including parameter lists of abstract closures), we also have various permutations on:
-
For each _x_
(For each element _k_ of _keys_
,For each integer _k_ that satisfies ...
, etc) -
Evaluate |Foo| to obtain _x_
, which is used only in regexes and is complicated by the fact that some of the regex operations return multiple results -
If there exists _x_
(e.g.) -
the smallest possible integer _k_ not smaller than
(e.g.) - For
yield
we say something likewhen evaluation is resumed with a Completion _resumptionValue_ the following steps will be performed
, which introduces_resumptionValue_
for use in those steps - For
MakeDay
we sayFind a value _t_ such that
Also, we occasionally refer to aliases from other algorithms. For example, the Function constructor says Let _args_ be the _argumentsList_ that was passed to this function by [[Call]]
.
Also also, some places we have algorithms (especially but not uniquely in Annex B, e.g. in the Note at the end of Number.prototype.toExponential) we're actually defining replacements to existing steps in to other algorithms, in which case it is not possible to locally determine which aliases are in scope.
Also, the things which are currently console.log
warnings should become linting errors, ideally.
Edit: done in https://github.com/tc39/ecmarkup/pull/229.
We should figure out if we want to use html entities for non-ASCII stuff, and enforce that decision. Currently we use both «
and «
.
Since I'm on a Mac, it's easy to type the actual characters, so I'd prefer that; but if others editing the spec aren't, it might not be as easy.
It's 2020, embrace the Unicode.
We should figure out if we want to use html entities for non-ASCII stuff, and enforce that decision.
Why not use ASCII quotes and convert automatically in build time? Here is how markdown-it
does this.
If an
If
has anelse
on the same line, it is spelled; else
rather than, else
or; else,
or; otherwise
Instead of enforcing consistency, how about adding tags for this, e.g.
1. <emu-if><cond>x is 1 and y is 1</cond><then>return 1</then></emu-if>
1. <emu-else>return 0</emu-else>
and
1. <emu-if-multiline><cond>_exponent_ is *+∞*<sub>𝔽</sub></cond><then>
1. <emu-if><cond>abs(ℝ(_base_)) > 1</cond><then>return *+∞*<sub>𝔽</sub></then></emu-if>
1. <emu-if><cond>abs(ℝ(_base_)) is 1</cond><then>return *NaN*</then></emu-if>
1. <emu-if><cond>abs(ℝ(_base_)) < 1</cond><then>return *+0*<sub>𝔽</sub></then></emu-if>
</then></emu-if-multiline>
1. <emu-else-multiline>...</emu-else-multiline>
This would allow enforcing a consistent style, would also allow authors to not have to remember where commas/semicolons go, and would also let you change the style across the entire document anytime (after all if
s have been converted of course).
Instead of enforcing consistency, how about adding tags for this
The ecmarkdown page says:
Some of Ecmarkdown's biggest benefits are when using it to write algorithm steps, without the many formalities HTML requires for list items and inline formatting of common algorithmic constructs.
Adding tags as you suggest would make algorithms harder to write + edit (and read, in source).