rfcs icon indicating copy to clipboard operation
rfcs copied to clipboard

Preprocessing API rework

Open dummdidumm opened this issue 2 years ago • 21 comments

Rendered

@kaisermann @arxpoetica @happycollision I'd like to hear your thoughts on this. Surely I missed some other pain points that need to be adressed.

dummdidumm avatar Jul 10 '21 10:07 dummdidumm

This looks very promising indeed and will also simplify quite a lot the preprocessors. I'm travelling right now, but will read it again once I'm back.

kaisermann avatar Jul 11 '21 09:07 kaisermann

looks like a very good idea. i think the current design is due to a hope that plugins would compose nicely with a predictable order, but i think in practice composing within one step like this would achieve the same goal + be far more flexible.

swyxio avatar Jul 14 '21 21:07 swyxio

Some things i remember from working with preprocessor arrays in svite and vite-plugin-svelte:

  1. a thing without a name is hard to find. It would be great if preprocessors had a unique name. That would also be helpful when reporting errors

  2. The ability to inspect capabilites of preprocessors can be helpful in some situations (eg find out if there is a preprocessor for typescript transform)

So maybe instead of a function and object signature would still be helpful?

{
  name: 'foo';
  preprocess: function
  capabilites: tbd
}

dominikg avatar Jul 19 '21 14:07 dominikg

{
  name: 'foo';
  preprocess: function
  capabilites: tbd
}

I like the idea of preprocessors creating some kind of registration object that contains information about the preprocessor itself. Perhaps at this stage, we don't worry so much about the capabilities part, since that could be added without a major version release. But paving the way for these identifiers is probably a good thing.

(To be clear, I only suggest skipping on capabilities because that seems quite a complex problem to solve and feels out of scope for this RFC. Not that it wouldn't give us some interesting capabilities.)

happycollision avatar Jul 27 '21 01:07 happycollision

Thanks for all your suggestions! I like the idea of making the new preprocessor API an object, too, with a name property. Tooling/libraries certainly can benefit from this in certain scenarios. I changed the API proposal accordingly. I also added the idea for isTopLevel for script/styles, a pending PR would make that possible (https://github.com/sveltejs/svelte/pull/6611). I also enhanced the "how we teach this" section with the blog post idea.

dummdidumm avatar Aug 16 '21 13:08 dummdidumm

I really like this way of writing preprocessors, feels more streamlined and predictable. I have some questions below:

If we do have https://github.com/sveltejs/svelte/pull/6611 expose a public API, do we need extractScripts and extractStyles? Instead, we can do something like const { scripts, styles } = svelte.extract(code) so the parser only runs once.

Regarding replaceInCode, it looks very similar to magic-string's overwrite() function, but it also handles replacing with sourcemaps (magic-string doesn't). I wonder if it's possible or make sense to have this functionaility directly in magic-string, and have svelte/preprocess re-export magic-string similar to how it currently re-exports estree-walker? It would encourage authors to support sourcemaps as well.

Another concern with replaceInCode is that once it generates a map (e.g. for a script), and we also alter the markup, which also generates a map, how do we combine these two maps? Should we split into two different preprocessors? I'd hope to not have to do that to keep the idea of a "svelte preprocessor" instead of a "script preprocessor" or "style preprocessor".

Also, about the removal of lang="..." attribute, it looks like the RFC would help solve this by giving start and end values for attributes, but I wonder how a preprocessor would be able to remove them, given the script preprocessor example given in the readme. (Probably related to the paragraph above). Or would this be out-of-scope of the RFC.

With the above said, I'm not too much a sourcemap wizard so I might be babbling stuff. Anyways, really looking forward to this!

bluwy avatar Aug 19 '21 04:08 bluwy

Quoting @pngwn from Discord:

I think we could leave the raw source preprocessor (for stuff like pug) step in place but add an 'AST' hook, some step that has to take in some source and spit out a svelte compliant AST, subsequent hooks could perform transformations on that AST.

I guess it would need to be more complex than that:

  • Template preprocessors would need the raw source and need to output a string -> first step i guess
  • CSS preprocessors need the raw css so an AST but before the CSS has been parsed - this is doable visit(n => n.type === 'style')
  • Script preprocessors need access to the raw script content and the contents of all expressions - this is possible for the common case (TS) but impossible to do in a language agnostic manner due to syntax ambiguities. Again it would be nice to be able to just: visit(n => n.type === 'script' || n.type === 'expression'). Also raises the issue that if you are parsing just to capture the string with a TS aware scanner then you may as well just parse it then as well, so it is a bit pointless.

Those preprocessors could perform transformations on those nodes and pass an AST to the compiler at the end.

So essentially preprocessing would be a two-stage process:

  1. string->string transformations for everything where the html part cannot be parsed by into a Svelte AST
  2. AST*->AST* transformations for everything else where the html part is already parsed into a Svelte AST, but can be transformed, minus the script/style which is still in raw string form

Problem with that is that there are preprocessors already AFAIK which rely on script being transformed but not on the html part being transformed because they do that themselves then. So maybe that two-stage process could be loosened so that we can order 1/2 like we want, and the preprocessing pipeline detects if this is a AST->AST or string->string preprocessor and prepares the input accordingly (transform AST back into string for example).

dummdidumm avatar Sep 02 '21 12:09 dummdidumm

This might also be a good opportunity to include an API for preprocessors to contribute to the warnings and errors similar to those output by svelte.compile. This allows tooling (e.g. VS Code) to pick them up accordingly. Integration with the svelte-ignore comments might also feature somewhere.

kwangure avatar Jul 07 '22 21:07 kwangure

So maybe that two-stage process could be loosened so that we can order 1/2 like we want, and the preprocessing pipeline detects if this is a AST->AST or string->string preprocessor and prepares the input accordingly (transform AST back into string for example).

We probably don't want to be transforming back-and-forth more than necessary. Maybe having various hooks at various stages of the pipeline like Rollup would help? E.g. transformPreAST, transformAST, transformPostAST`

benmccann avatar Aug 23 '22 04:08 benmccann

I like this but I think we can go simpler — if we exposed a proper parser and require the input to that parser to adhere to certain rules, then I don't think we need extractStyles and extractScripts functions or an opinionated way to handle replacements — we could just expose the parser like so:

import MagicString from 'magic-string';
import { parse } from 'svelte/compiler';

const preprocessor = {
  name: 'foo-to-bar',
  preprocess: ({ code, filename }) => {
    const ast = parse(code);
    const result = new MagicString(code);

    // this is all we need to get top-level scripts. no faffing
    const top_level_scripts = ast.body.filter((element) => element.name === 'script');

    for (script of top_level_scripts) {
      const { start, end, value } = script.children[0]; // <script> will only ever have one child, a text node
      for (const match of value.matchAll('foo')) {
        result.overwrite(
          start + match.index,
          start + match.index + match[0].length,
          'bar',
          { storeName: true }
        );
      }
    }

    return {
      code: result.toString(),
      map: result.generateMap()
    };
  }
}

Requirements for using parse:

  • the input is HTML-like. if you're the kind of self-flagellating maniac who wants to use Pug, then bring your own parsing logic
  • the contents of curlies inside templates adhere to some basic rules, i.e. {/} characters must be balanced unless they are in something universally recognised as a string or a comment. so {"{"} would be treated as a valid expression, but {{} would not

(I think there's an interesting conversation to be had separately about whether we want to support TypeScript syntax in curlies natively, but I don't know that we need to resolve that here.)

Rich-Harris avatar May 16 '23 16:05 Rich-Harris

https://github.com/sveltejs/svelte/pull/6611 is related to that as it parses the mustache tags accordingly.

Does your proposal mean that we just make the existing parse method be able to deal with anything inside mustache tags? What effect does this have on the AST being semi-public API? Would there be negative impact on performance doing it this way (because maybe the parser is slower than the regex) - though the answer to that question may just be "keep using the regex then, you're free to do whatever".

dummdidumm avatar May 16 '23 16:05 dummdidumm

Does your proposal mean that we just make the existing parse method be able to deal with anything inside mustache tags? What effect does this have on the AST being semi-public API?

It's honestly quite problematic that the AST isn't considered public API. I think Svelte 4 would be a good opportunity to correct that.

A noteworthy thing here of course is that parse currently parses JS expressions inside curlies and <script>. The approach I suggested above would basically just pass through the contents as a raw string; if anything (including Svelte itself) wanted to operate on those contents it would need to parse the expression in a separate step.

We would also have to be prescriptive about control flow — in order to support things like a preprocessor that converted {#switch ...} to the equivalent {#if ...} blocks (which I think is desirable) then we would presumably need to have some feature-agnostic way to express that, so for example this...

{#if error}
  <p>oh no!</p>
{:else if vibes === "cool"}
  <p>everything is cool</p>
{:else}
  <p>no error, but the vibes are off</p>
{/if}

{#each things as thing, i (thing.id)}
  <p>{thing.name}</p>
{:else}
  <p>no things</p>
{/each}

{#switch color}
  <p>default case</p>
{:case "red"}
  <p>red</p>
{:case "blue"}
  <p>blue</p>
{:case "green"}
  <p>green</p>
{/switch}

...might be represented thusly:

[
  {
    type: 'block',
    kind: 'if',
    branches: [
      {
        value: 'error',
        children: [...]
      },
      {
        value: 'else if vibes === "cool"',
        children: [...]
      },
      {
        value: 'else',
        children: [...]
      }
    ]
  },
  {
    type: 'block',
    kind: 'each',
    branches: [
      {
        value: 'things as thing, i (thing.id)',
        children: [...]
      },
      {
        value: 'else',
        children: [...]
      }
    ]
  },
  {
    type: 'block',
    kind: 'switch',
    branches: [
      {
        value: 'color',
        children: [...]
      },
      {
        value: 'case "red"',
        children: [...]
      },
      {
        value: 'case "blue"',
        children: [...]
      },
      {
        value: 'case "green"',
        children: [...]
      }
    ]
  }
]

In other words each {#<word> ...} opener would need to be matched by a {/<word>} closer, and each {:...} would open a new branch, but beyond that any judgement of 'correctness' would be the responsibility of the compiler.

(Speaking of correctness: should the parser be fault-tolerant? Is it beneficial for things like editor integrations if we're able to generate an AST from partial content?)

Would there be negative impact on performance doing it this way (because maybe the parser is slower than the regex) - though the answer to that question may just be "keep using the regex then, you're free to do whatever".

Exactly — if the contract is as simple as ({ code, filename }) => Promise<{ code, map, dependencies }> then people are free to do whatever they like. parse would just be a convenience API.

Rich-Harris avatar May 16 '23 18:05 Rich-Harris

Love the idea of simplifying down to just a one-stage process as I've also run into some pain with the current three-stage process.

I don't think we need extractStyles and extractScripts functions or an opinionated way

This is fine, but could you include a small example in the future docs of how to target these blocks using regex (location, attributes, and content) since you've already been through the ins and outs of such. In large apps, the speed difference between AST and regex is significant, and I think you would save a lot of users a lot of time by giving some examples (including something like the AST example shown above as well).

jacob-8 avatar May 17 '23 03:05 jacob-8

(Speaking of correctness: should the parser be fault-tolerant? Is it beneficial for things like editor integrations if we're able to generate an AST from partial content?)

Yes it would be very much benefitial for editor integrations since right now every syntax error makes the whole file completely different in the underlying TS representation. But I don't think we need to solve this as part of the preprocessor rework.

In general I feel like the comments go away from just the preprocess rework towards a parser rework which is more involved and, if not behind some flag or through a different parser method, would be a breaking change. Right now parse returns the AST in a specific form as you said, including the JS AST, and Prettier and probably also ESLint depend on that.

From what you're getting at it sounds like you want something similar to what @pngwn did with the svast specification and its implementation.

dummdidumm avatar May 17 '23 09:05 dummdidumm

But I don't think we need to solve this as part of the preprocessor rework.

I was wondering if we should implement the parser using something like Lezer, since we'd then get error tolerance for free

Rich-Harris avatar May 17 '23 11:05 Rich-Harris

Lezer docs talk about its limitations/tradeoffs:

Lezer syntax trees are not abstract, they just tell you which nodes were parsed where, without providing additional information about their role or relation (beyond parent-child relations). This makes them rather unsuited for some purposes, but quick to construct and cheap to store.

I'm not familiar about with the details but my impression is that Lezer is not meant for much beyond syntax highlighting.

kwangure avatar May 17 '23 12:05 kwangure

I was wondering if we should implement the parser using something like Lezer, since we'd then get error tolerance for free

My fear would be that this results in slower parsing, would be nice if it would be the other way. @pngwn's work AFAIK was much faster (sacrificing readability for it because it's all one big file).

dummdidumm avatar May 17 '23 12:05 dummdidumm

On the topic of error-tolerance/Lezer, you could have detailed error tolerance (i.e custom recovery heuristics specific to each error). This allows for faster recovery (recovers fewer tokens away), better user errors and enables things like autofix and VS Code code-actions but at the expense of memory, speed, bundle-size etc.

On the other hand, simple heuristics can get you pretty far. For example, I implemented a Svelte parser that has error recovery for some cases, usually resuming parsing at the next < which starts a new open tag. That one heuristic works well enough for common cases and could probably be shoed onto the current implementation of the parser. It comes down to finding a few good recovery heuristics.

About the Svelte parserer I started (but haven't finished): I've been prototyping a state machine library (https://github.com/kwangure/hine) and to test it, I built a Svelte parser (https://github.com/kwangure/parserer). It resumes parsing at the next <. ASTs with errors look like this:

{
    "html": {
        "start": 0,
        "end": 13,
        "type": "Fragment",
        "children": [
            {
                "error": {
                    "code": "invalid_attribute_name",
                    "message": "Expected a valid attribute character but instead found '+'",
                    "start": 6,
                    "end": 13,
                    "type": "Invalid"
                },
                "start": 0,
                "end": 13,
                "type": "Element",
                "name": "div",
                "attributes": [],
                "children": []
            }
        ]
    }
}

I just made the Parserer repo public for the purpose of this comment but it was never meant for public consumption so it has rough edges and YMMV if you choose to run it. Neither Hine or Parserer is primetime-ready.

Also, state machines are good.

kwangure avatar May 17 '23 13:05 kwangure

We want to tackle the new preprocessor API for Svelte 4. This would probably only include the { name: string; preprocess: Function } part of the proposal, the convenience methods etc are left out for now as we might want to tackle this differently.

One major advantage I just realized now with the existing API is that it handles merging of source maps etc for you. It takes care of extracting the source map and transforming it so that you don't need to wrestle with all that stuff yourself. For example if you run TypeScript on the script code, you get back something like

export const foo;
// #sourceMappingURL=..

and Svelte's preprocess function takes care of removing that comment, locating and loading the source map and then merging it with the surrounding markup so that it creates one final source map - all that carefully crafted for optimal performance. I wouldn't want to put the burden of implementing something like that on every individual preprocessor author.

So maybe the existing API isn't that bad overall? Maybe just the order needs to be switched to run in sequence; and you can't have multiple anymore and chose between either markup, script or style per preprocessor entry? That would also allow for other options like "remove the lang tag" on the preprocessor object.

One thing I don't see how it's possible to do yet through that API is, once we allow TypeScript inside mustache tags, you have the problem of all the source mapping again - at which point some kind of convenience method is needed to achieve this. So maybe the solution is to implement the new API but with a function which takes care at least of that?

If we decide to depracate existing API and go with the new API, we will add more documentation around creating a preprocessor with some code snippets including the regex approach to ease onboarding. The old API will keep functioning but a warning is emitted that it will stop doing so in Svelte 5.

dummdidumm avatar May 19 '23 14:05 dummdidumm

Removing PreprocessorGroup makes it a bit harder for preprocessors like vitePreprocess to be easily added in one line.

To avoid users having to add multiple imported parts, it would help to allow returning arrays that get flattened, similar to how vite handles plugins.

Handling typescript inside template strings is a tricky situation. It needs to be done in a single modification + sourcemap so that we can continue with other preprocessors and merge everything at the end. How do we avoid frequent string operations on the whole code? Going with a flat list of preprocessors and allowing markup after script/style means we would have to reparse the script/style content after every markup preprocessor to be able to provide it to the next one.

Having separate script and style preprocessors is also very convenient for preprocessor development and allows more performance optimizations like running script and style preprocessors in parallel.

I'd love if there was a way we can keep a "phases" approach, maybe limit it to markup and ast phases and try to come up with a way to pass along ast and maybe magicstrings to avoid having to redo a lot of work.

On that note, what about making returning sourcemaps and dependencies mandatory?

dominikg avatar May 20 '23 21:05 dominikg

Could you elaborate on that performance optimization thing? How would that help? Right now everything's run sequentially and things are re-parsed on every preprocessor step. Are you suggesting we change that to speed things up? How would that look like? Would a sub process or thread really make things faster here, assuming you generate one for every file?

I agree with you that script/style preprocessors are very convenient to write right now because the plumbing is done for you already - which is why I think going with Rich's proposal wouldn't suffice. The original RFC proposal would make it almost as convenient as now albeit with a bit more code.

dummdidumm avatar May 22 '23 09:05 dummdidumm