espree icon indicating copy to clipboard operation
espree copied to clipboard

Change Request: `Program` range cover whole program

Open overlookmotel opened this issue 8 months ago • 22 comments

Which packages would you like to change?

  • [x] espree
  • [ ] eslint-scope
  • [ ] eslint-visitor-keys

What problem do you want to solve?

The source range for Program is surprising when source code starts or ends with comments.

e.g.:

// foo

x

// bar

Espree (like Esprima) produces an AST where range of Program node excludes whitespace and comments from start and end of file. In example above, the range of Program node is [ 8, 9 ] i.e. the range of the sole statement.

AST explorer - Espree

This differs from Acorn, which produces a range of [0, 17] i.e. the whole of the source code.

AST explorer - Acorn

What is particularly surprising is that with option comment: true you get comments which are outside the range of the program that contains them:

{
  type: 'Program',
  start: 8,
  end: 9,
  body: [
    {
      type: 'ExpressionStatement',
      start: 8,
      end: 9,
      expression: { type: 'Identifier', start: 8, end: 9, name: 'x' }
    }
  ],
  sourceType: 'script',
  comments: [
    { type: 'Line', value: ' foo', start: 0, end: 6, range: [ 0, 6 ] },
    { type: 'Line', value: ' bar', start: 11, end: 17, range: [ 11, 17 ] }
  ]
}

As a further surprise, if the file contains only comments, then the behavior changes:


// foo

This produces a range of [ 0, 8 ]. This does not align with Esprima, which produces range of [ 8, 8 ].

What do you think is the correct solution?

Espree's docs state that:

The primary goal is to produce the exact same AST structure and tokens as Esprima, and that takes precedence over anything else.

Given that Esprima has not received updates for 7 years, and is essentially defunct, I'm not sure whether the maintainers of Espree consider alignment with Esprima still to be the primary goal. I'm also not sure whether that aim, if it still holds, extends to source ranges, or just the AST shape.

What's the correct behavior?

This is arguable. But, in my opinion, it's unintuitive that comments which are part of a program exist in ranges outside of the Program.

The example where the program consists of only comments I think is illustrative. What's the logical range for Program to have in this case, if comments are excluded? [0, 0] or [8, 8]?

The fact that there isn't a clear answer I feel reveals there's a problem with the mental model.

[0, 8] (as Espree currently does) makes most sense to me. But this is inconsistent with Esprima, and also inconsistent with Espree's behavior when there are statements in the program.

Others may feel differently.

Why does this matter?

In addition to the difference from Acorn, @typescript-eslint/parser is also different again - for the first example, it gives a range of [8, 18]! Though the maintainers seem to think Acorn's behavior would be preferable.

AST explorer - TS-ESLint

Personally, I think it's unfortunate that the 3 most depended-on ESTree-compatible parsers Espree, Acorn, and TS-ESLint differ in this respect.

What to do?

One way or another, in my opinion Espree's current behavior is incorrect.

  • In the case which program is only comments, output differs from Esprima, violating Espree's stated goal.
  • Or, if alignment with Esprima is not an aim as far as ranges are concerned, it would make more sense to have a less surprising behavior.

Changing behavior to match Acorn (which personally I'd advocate for) would be a breaking change. Though, given that the 3 most popular ESTree-compatible parsers differ from each in this regard and I'm not aware of anyone complaining about it before, probably not a big one!

Participation

  • [x] I am willing to submit a pull request for this change.

Additional comments

No response

overlookmotel avatar Apr 07 '25 20:04 overlookmotel

Duplicate of #488

fisker avatar Apr 09 '25 13:04 fisker

Thanks for the feedback. As I stated in https://github.com/eslint/js/issues/488#issuecomment-829716430, Espree is the canonical ESLint parser. Any parser wanting to work with ESLint should mimic what Espree does. As such, we need to be careful when making changes.

The decision of where to start the Program node has long since passed. It will remain as it was with Esprima.

I think producing comments that fall outside of the range of Program is probably a bug, as ESTree is meant to have each node wholly contain its children. @eslint/eslint-tsc what do you think?

For reference, here's the code that adjusts the Program location:

https://github.com/eslint/js/blob/2800ec840bfe811e3def5b1a6e15828031cf8fbf/packages/espree/lib/espree.js#L184-L210

I think the issue is that comments are not represented as tokens, so the end location currently only considers non-comment syntax.

nzakas avatar Apr 09 '25 15:04 nzakas

Thanks very much for coming back. Sorry that I didn't see this had been raised before in #488.

Thanks for confirming that the primary goal for Espree remains aligning with Esprima. In that context, of course the answer is clear. The current behavior cannot change.

Only 2 questions remain:

Incompatibility with Esprima

If the program contains only comments e.g.:

// foo

Espree produces a range of [ 0, 8 ]. This does not align with Esprima, which produces range of [ 8, 8 ].

I assume that divergence from Esprima is unintended? If so, should that be considered a bug (fixing it not a breaking change)? Or something to tackle in next major release?

Comments outside of Program

ESTree is meant to have each node wholly contain its children

Personally I don't think that needs to be considered to apply to comments, as comments are not "nodes" in the sense of the ESTree spec, which doesn't say anything about comments (same as tokens are not "nodes").

Often legal comments or "directive" / "pragma" comments appear at top of file, and it'd likely be problematic for users if those comments were omitted.

Espree's NPM docs specifically state that hashbangs are supported. Hashbangs are treated as comments, and by definition must be at top of file. So excluding all comments that lie outside the range of Program would break that support.

overlookmotel avatar Apr 10 '25 10:04 overlookmotel

I assume that divergence from Esprima is unintended? If so, should that be considered a bug (fixing it not a breaking change)? Or something to tackle in next major release?

I think this was unintended as I can't imagine we tested this case when we switched to Acorn. I think this is an indicator that we just need to come to a decision on one question: are comments considered part of Program or not? If yes, then the current behavior for a file with just a comment is correct and we need to adjust what happens when there are comments at the end of a program to make it align. If no, then we should fix the behavior of the one-comment case and leave the other case alone.

nzakas avatar Apr 10 '25 15:04 nzakas

are comments considered part of Program or not?

In my view they are. But you've clearly stated that the goal is to align with Esprima, so I thought that answered the question - Esprima does not consider comments to be part of the Program.

overlookmotel avatar Apr 10 '25 20:04 overlookmotel

@overlookmotel As I noted, there is an inconsistency in the implementation. Trailing comments are not treated as part of the program but leading comments are. I'm seeking to resolve that inconsistency.

nzakas avatar Apr 11 '25 14:04 nzakas

@nzakas I don't see an inconsistency between how leading comments are treated vs how trailing comments are treated.

I believe Program's range currently works as follows:

  1. If there are no tokens, i.e., the source code is only whitespace and/or comments, then the Program's range is the whole source code.
  2. If there's at least one token, then the Program's range starts from the first token's start and ends at the last token's end. Whitespace and comments before the first token and after the last token are not included in the Program's range.

Is that correct? If not, can you provide an example where there's an inconsistency between how leading and trailing comments are treated?

mdjermanovic avatar Apr 11 '25 18:04 mdjermanovic

Sorry, I don't think I explained the inconsistency I'm seeing very well. Let me try again.

For instance, here's an example with just a comment: https://astexplorer.net/#/gist/7896dbd46cd027a8414f45d6c5c5e763/055c0d167a560b52c8cc48a614aacfb50ef07906

In this case, the Program range is 0 to 181.

If I add some code underneath it, the range changes to 178 to 182: https://astexplorer.net/#/gist/4acf3a930a735d191dd9891280a0a7b9/d3d4bc7af2a20c476676b9f4fc502714eebf6e63

That's the inconsistency I'm concerned with. It seems like the start of the program should be the start of the program. (Similarly, the end of the program should be the end of the program.) But we are counting leading and trailing comments differently in these situations.

nzakas avatar Apr 15 '25 19:04 nzakas

Thanks for coming back. I believe the only change required to resolve this inconsistency is as follows:

  • If the file contains no tokens (no statements), set Program range start and end both to the length of the source text.

i.e. in your first example, Program's range would be [181, 181].

This would match Esprima's behavior.

There are other alternative solutions which could also resolve the inconsistency, but they would not align with Esprima, which I've understood from your previous comments is of paramount importance.

I hope I've made sense this time!

overlookmotel avatar Apr 16 '25 19:04 overlookmotel

For instance, here's an example with just a comment: https://astexplorer.net/#/gist/7896dbd46cd027a8414f45d6c5c5e763/055c0d167a560b52c8cc48a614aacfb50ef07906

In this case, the Program range is 0 to 181.

If I add some code underneath it, the range changes to 178 to 182: https://astexplorer.net/#/gist/4acf3a930a735d191dd9891280a0a7b9/d3d4bc7af2a20c476676b9f4fc502714eebf6e63

That's the inconsistency I'm concerned with. It seems like the start of the program should be the start of the program. (Similarly, the end of the program should be the end of the program.) But we are counting leading and trailing comments differently in these situations.

I still don't see an inconsistency between leading and trailing comments. If there are no tokens, the range spans from the start to the end of the source code, whether the source code content is whitespace-only or contains one or multiple comments. If there is at least one token, the range spans from the start of the first token to the end of the last token, and neither leading nor trailing comments are included.

For example, in the following case the Program range is [0, 56]. The whole source code has 56 code units, and all of the leading comment, trailing comment and whitespace are included in the range:

/**
 * Leading comment
 */

/**
 * Trailing comment
 */

https://astexplorer.net/#/gist/675eeef0df636f1a0297bd48acb29f74/f8e6977b4274852d115199907142b967b1a6e081

While in the following case the Program range is [28, 31]. It's the range of foo, and neither leading nor trailing comments are included:

/**
 * Leading comment
 */

foo

/**
 * Trailing comment
 */

https://astexplorer.net/#/gist/a0cc72f7d660b457ef68bbea44e6eae4/4f83a3bba3a633e3d03a56b5c2b8c4583e9bcf6e

mdjermanovic avatar Apr 17 '25 18:04 mdjermanovic

@mdjermanovic okay, you've lost me a bit. The example you show is exactly the inconsistency I'm talking about. It doesn't make sense that comments and whitespace are considered part of the program in one case and not in the other. I don't see why having one token should change how we're calculating the start and end ranges of the program.

nzakas avatar Apr 18 '25 21:04 nzakas

I think the root of this confusion goes back to my original point that the mental model doesn't entirely hang together. Consequently, different people can come at this question from different directions and reach different answers.

However, it's a stated goal of Espree to align with Esprima, so we need to find a solution which both:

  1. Is internally consistent, and
  2. Matches Esprima.

The rule that Esprima follows is that Program span is the start of first token to end of last token. Esprima's interpretation of this rule when the program has no tokens is that Program's range has to be zero length. They have chosen that zero-length range to be at end of source text ([56, 56]) rather than at start ([0, 0]). That's an arbitrary choice, but it's consistent.

So, in my view, Esprima's schema is internally consistent. Therefore the only option which satisfies both criteria is to align with Esprima's behavior and make the range for an empty program [<length of source text>, <length of source text>].

Other interpretations are also valid, but they fail on the 2nd criteria.


Side note: If we were starting from ground zero and didn't have to care about breaking changes or alignment with Esprima, I'd advocate ripping it all up and making Program's span always be [0, <length of source text>]. That'd also be internally consistent, and to my mind more intuitive. But that's not an option, unfortunately.

overlookmotel avatar Apr 22 '25 08:04 overlookmotel

@overlookmotel Thanks for the summary. At this point, I'm not sure we need to align this with Esprima, that's why I've opened this up for larger discussion. Espree is used far more than Esprima, so we have more license around the edges. We can't change things like the AST format or tokens format as that would break too much compatibility. Because we are already not following Esprima for Program range means that any change we make is technically breaking so I'd like to make one that is consistent and, ideally, what we would want if starting from zero.

nzakas avatar Apr 22 '25 14:04 nzakas

Ah, sorry I'd misunderstood. Given that this thread has become lengthy, let me summarize my personal view on what I think would be the ideal if we were starting from scratch.

The problems:

  • It's weird/surprising that comments which are part of a program have ranges which are outside the range of the Program itself.
  • It's ambiguous what is the "right" range for a Program containing no tokens, if we follow the current rule that Program starts on first token, and ends on last token.
  • It's unfortunate for interoperability that Espree and Acorn disagree on this.

I propose that in all cases:

  • Program range start is 0.
  • Program range end is length of the source text.
  • loc follows the same pattern. i.e.:
    • loc.start is always { line: 1, column: 0 }
    • loc.end.line is the number of lines in the source text.
    • loc.end.column is the number of characters in last line of source text.

This would be different from Esprima, but would align with Acorn (and also Babel).

For what it's worth, TS-ESLint parser maintainers also seem to think this would be preferable.

I will now shut up and let others give their views!

overlookmotel avatar Apr 23 '25 10:04 overlookmotel

I propose that in all cases:

* `Program` range start is 0.

* `Program` range end is length of the source text.

* `loc` follows the same pattern. i.e.:
  
  * `loc.start` is always `{ line: 1, column: 0 }`
  * `loc.end.line` is the number of lines in the source text.
  * `loc.end.column` is the number of characters in last line of source text.

Yes, that's what I think as well.

nzakas avatar Apr 23 '25 14:04 nzakas

If we agree to update the range of Program, can we reconsider updating the range of TemplateElement too? The location covers ${ and } doesn't make sense.

fisker avatar Apr 23 '25 16:04 fisker

@fisker please open a separate issue for that.

nzakas avatar Apr 24 '25 15:04 nzakas

No, I won't. It's none of my business, just a reminder.

fisker avatar Apr 24 '25 23:04 fisker

@eslint/eslint-tsc what do you think about https://github.com/eslint/js/issues/648#issuecomment-2824478224?

nzakas avatar May 13 '25 14:05 nzakas

I propose that in all cases:

* `Program` range start is 0.

* `Program` range end is length of the source text.

* `loc` follows the same pattern. i.e.:
  
  * `loc.start` is always `{ line: 1, column: 0 }`
  * `loc.end.line` is the number of lines in the source text.
  * `loc.end.column` is the number of characters in last line of source text.

Sounds good to me. And I think this should be considered a breaking change because it could disrupt code that expects leading and trailing comments to be excluded from the range of the Program node. I was tempted to write such code in the past (https://github.com/eslint/eslint/pull/16889#discussion_r1141987590), so I wouldn't be surprised if any consumers were relying on the current behavior.

fasttime avatar May 14 '25 04:05 fasttime

I'm in favor of this change as I think it makes the most sense that the root node always represents the whole of the source text, and this would introduce consistency between cases where there are tokens in the code and cases where there aren't any tokens but just whitespace and/or comments. This, however, also introduces one inconsistency: Program node becomes the only node whose range doesn't necessarily start with the range of its first token. But we could say that the Program node is a special case in that regard.

This would definitely be a breaking change for ESLint-compatible custom parsers, as they should align, and for rules that rely on the current Program.range behavior (either by using Program.range directly, or indirectly through SourceCode methods, e.g., sourceCode.getCommentsBefore(programNode) to get all leading comments).

We should also check if it would be necessary to change something in SourceCode and TokenStore implementations, because this breaks the assumption that nodes always start with tokens.

mdjermanovic avatar May 14 '25 11:05 mdjermanovic

Okay, marking as accepted. We should definitely prototype this change and try integration with ESLint to see what breaks in order to get an idea for the extent of breakage.

Anyone want to volunteer?

nzakas avatar May 14 '25 15:05 nzakas

TSC group has confirmed this for v10

sam3k avatar Jul 29 '25 01:07 sam3k

👋 Hi! This issue is being addressed in pull request https://github.com/eslint/js/pull/677. Thanks, @Pixel998!

eslint-github-bot[bot] avatar Sep 15 '25 07:09 eslint-github-bot[bot]