nuxt-security icon indicating copy to clipboard operation
nuxt-security copied to clipboard

perf: avoid cheerio in favor of regex

Open GalacticHypernova opened this issue 1 year ago • 38 comments
trafficstars

Types of changes

  • [ ] Bug fix (a non-breaking change which fixes an issue)
  • [ ] New feature (a non-breaking change which adds functionality)
  • [ ] Breaking change (fix or feature that would cause existing functionality to change)

Description

This PR integrates native solution via regex to load and parse HTML, in order to improve runtime performance. Closes #341

Checklist:

  • [ ] My change requires a change to the documentation.
  • [ ] I have updated the documentation accordingly.
  • [ ] I have added tests to cover my changes (if not applicable, please state why)

GalacticHypernova avatar Mar 23 '24 21:03 GalacticHypernova

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
nuxt-security ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 10, 2024 4:55pm

vercel[bot] avatar Mar 23 '24 21:03 vercel[bot]

Hey @GalacticHypernova @Baroshem

Here is a description of how it works. The 3 main routines are 03-subresourceIntegrity, 04-cspSsgHashes and 99-cspSsrNonces. Each one is executed consecutively in this order

  • 03-subresourceIntegrity needs to find all external script tags and detect the name of the file. Integrity hashes were already pre-computed at build time so once the name is found, the integrity tag can be inserted immediately. The regex needs to make the difference between external scripts (<script ... src="filename.js" .../>) and inline scripts (<script ...>//some code</script>) because only external resources can carry integrity. Basically the endgame is to transform <script ... src="filename.js" .../> into <script ... src="filename.js" ... integrity="xxx"/>, so this modifies the HTML source. Same for external links, except that not all links are eligible so the regex should also parse the rel attribute of the link.

  • 04-cspSsgHashes is a bit of the contrary. Only inline scripts are hashed so the regex needs to do the opposite: only find inline elements and ignore external ones. The problem is that integrity hashes for external resources must also be added to the CSP header in the SSG case, so we still need to parse the external scripts for that purpose; However this is not extremely difficult because at the previous step all integrity tags were already inserted so it is easy to parse and find the value. Same for inline styles and links. In short, 04-cspSssgHashes does not modify the HTML, it only collects the hashes that need to be inserted in the CSP directives.

  • 99-cspSsrNonces does not need to make the difference between external resources and inline elements, because the nonce is added to all elements. The integrity hashes were already added in the first step, and they don't need to be repeated in the CSP header so we don't care about integrity tags at all here. 99-cspSsrNonces modifies the HTML source by inserting the nonce attribute to the relevant tags.

Now at a higher level, these three routines are called at the time the html is rendered, with the render:html hook. This hook has an html variable which is an object with keys such as head, body, etc. Normally the scripts should be in head but I found out that sometimes they can be in other sections, so for the sake of completeness we inspect all sections.

Cheerio is used to parse the content of each section and find out the relevant parts of the HTML that need to be modified. Cheerio takes an HTML string as input, and transforms it into an AST tree of nodes. It then provides very convenient utilities to find elements by tag, make the difference between external and inline, inspect the attributes, modify the nodes, etc. But it is too slow.

This is why I added 02a-preprocessHtml and 99b-recombineHtml : it avoids parsing the same sections each time. So 02a-preprocessHtml starts by loading each of the sections in the cheerio parser, via cheerio.load. Each parsed section is saved under context.cheerios, which is just a temporary object to hold these results. Then 03-subresourceIntegrity, 04-cspSsgHashes and 99-cspSsrNonce can access the context.cheerios[section] temporary results, instead of having to cheerio.load the section over and over again. At the end we just recombine everything back into HTML with 99b-recombineHTML.

As you can see 02a and 99b were added afterwards, for performance reasons only. Hope this helps

vejja avatar Mar 25 '24 13:03 vejja

@vejja thanks for the explanation! This really does help a lot, and will be very useful for the PR. Does that mean no further modification needs to be made for sections in 02a (before the 3 routines)? No need to trim or anything like that? I was just wondering if it actually needs to be modified at all. P.S. I already have an idea in mind to handle caching these so that there won't be a need to iterate over everything again.

GalacticHypernova avatar Mar 25 '24 18:03 GalacticHypernova

@vejja thanks for the explanation! This really does help a lot, and will be very useful for the PR. Does that mean no further modification needs to be made for sections in 02a (before the 3 routines)? No need to trim or anything like that? I was just wondering if it actually needs to be modified at all. P.S. I already have an idea in mind to handle caching these so that there won't be a need to iterate over everything again.

I think if you can get rid of 02a and 99b, it would be fantastic. Honestly they are a legacy of trying to quick-fix the cheerio issues

vejja avatar Mar 25 '24 18:03 vejja

I think if you can get rid of 02a and 99b, it would be fantastic. Honestly they are a legacy of trying to quick-fix the cheerio issues

👍 I mean, I do have a caching strategy in mind which might require 02a, I'm not sure yet, but 99b I do believe will be unnecessary.

GalacticHypernova avatar Mar 25 '24 18:03 GalacticHypernova

If I am correct (I haven't worked much with nitro's HTML hooks myself), the bodyAppend/prepend/etc return strings, right? They just return the stringified HTML? Because I see a $('script') inside a forEach so I want to make sure on what I actually need to do. In case it's an array within an array (like an array in each section) I would need to adjust the regex accordingly.

GalacticHypernova avatar Mar 25 '24 18:03 GalacticHypernova

I can’t remember but I think Typescript gives you the type

Edit: it’s actually an array of strings:

https://github.com/nuxt/nuxt/blob/71ef8bd3ff207fd51c2ca18d5a8c7140476780c7/packages/nuxt/src/core/runtime/nitro/renderer.ts#L15

referenced in https://nuxt.com/docs/api/advanced/hooks#nitro-app-hooks-runtime-server-side

vejja avatar Mar 25 '24 20:03 vejja

Thanks! I remembered that vaguely but I also didn't get the chance to check it myself.

GalacticHypernova avatar Mar 25 '24 20:03 GalacticHypernova

@vejja in 03, can external scripts have <script src=...></script> syntax, or is it guaranteed to be <script src=.../>

GalacticHypernova avatar Mar 26 '24 11:03 GalacticHypernova

@vejja in 03, can external scripts have <script src=...></script> syntax, or is it guaranteed to be <script src=.../>

I think in theory it should always be the first syntax as

Update: Upon reasearching the topic, I think we should always assume it is going to be <script ...></script>.

Let's keep our lives simple and not support <script .../>. It is not HTML spec-compliant, and Nuxt does not use this syntax. Trying to have a regex that catches a useless scenario is likely to introduce more bugs than solve problems.

Thanks for bringing this up @GalacticHypernova

vejja avatar Mar 26 '24 12:03 vejja

Also if you're wondering what regex we used to have before Cheerio, you can check out: https://github.com/Baroshem/nuxt-security/blob/v1.0.0-rc.1/src/runtime/nitro/plugins/02-cspSsg.ts https://github.com/Baroshem/nuxt-security/blob/v1.0.0-rc.1/src/runtime/nitro/plugins/99-cspNonce.ts

vejja avatar Mar 26 '24 13:03 vejja

@vejja in 03, can external scripts have <script src=...></script> syntax, or is it guaranteed to be <script src=.../>

I think in theory it should always be the first syntax as

Update: Upon reasearching the topic, I think we should always assume it is going to be <script ...></script>.

Let's keep our lives simple and not support <script .../>. It is not HTML spec-compliant, and Nuxt does not use this syntax. Trying to have a regex that catches a useless scenario is likely to introduce more bugs than solve problems.

Thanks for bringing this up @GalacticHypernova

Thanks! In theory I could still support that syntax relatively easily, it depends on whether nitro handles the html conversion to the spec compliant version or not, otherwise user-supplied scripts may not have the integrity if they do use the self-closing syntax.

Also if you're wondering what regex we used to have before Cheerio, you can check out: https://github.com/Baroshem/nuxt-security/blob/v1.0.0-rc.1/src/runtime/nitro/plugins/02-cspSsg.ts https://github.com/Baroshem/nuxt-security/blob/v1.0.0-rc.1/src/runtime/nitro/plugins/99-cspNonce.ts

Yea, I made the patterns a bit more specific so that it would be easier to handle, though we would likely want to test its performance.

GalacticHypernova avatar Mar 26 '24 13:03 GalacticHypernova

Thanks! In theory I could still support that syntax relatively easily, it depends on whether nitro handles the html conversion to the spec compliant version or not, otherwise user-supplied scripts may not have the integrity if they do use the self-closing syntax.

Up to you. You're absolutely right that otherwise they won't have integrity if they use the self-closing syntax

vejja avatar Mar 26 '24 14:03 vejja

Also what about links? Are they always self closing? Or should I treat them the same as script tags?

GalacticHypernova avatar Mar 26 '24 15:03 GalacticHypernova

Also what about links? Are they always self closing? Or should I treat them the same as script tags?

I think <link> is always empty. It is never self closing, and it has no closing </link> tag. So it should always be like <link rel="stylesheet" href="example.css">

vejja avatar Mar 26 '24 15:03 vejja

We might want to look into a hybrid approach, at least until a suitable solution is found for the edge cases that are harder to catch with regex. That is, for external scripts/links/etc, regex would be used, but for inline stuff cheerio. That way it would still increase performance (as it would be used less), but wouldn't compromise integrity entirely (again, as a temporay solution). Would love to hear your thoughts on this

GalacticHypernova avatar Mar 26 '24 18:03 GalacticHypernova

We might want to look into a hybrid approach, at least until a suitable solution is found for the edge cases that are harder to catch with regex. That is, for external scripts/links/etc, regex would be used, but for inline stuff cheerio. That way it would still increase performance (as it would be used less), but wouldn't compromise integrity entirely (again, as a temporay solution). Would love to hear your thoughts on this

Isn't an inline script the same as an external script except that it doesn't have an src attribute ? Let me dig into this, I need to read the HTML spec 🤮🤮🤮

vejja avatar Mar 26 '24 18:03 vejja

Isn't an inline script the same as an external script except that it doesn't have an src attribute ? Let me dig into this, I need to read the HTML spec 🤮🤮🤮

Well, not exactly. Assuming the worst case scenario, that everyone using this module is trying to cause harm, there are a lot more ways for inline scripts to cause trouble.

GalacticHypernova avatar Mar 26 '24 18:03 GalacticHypernova

Isn't an inline script the same as an external script except that it doesn't have an src attribute ? Let me dig into this, I need to read the HTML spec 🤮🤮🤮

Well, not exactly. Assuming the worst case scenario, that everyone using this module is trying to cause harm, there are a lot more ways for inline scripts to cause trouble.

I'm not sure what the issue could be ?

vejja avatar Mar 26 '24 18:03 vejja

I'm not sure what the issue could be ?

')

GalacticHypernova avatar Mar 26 '24 18:03 GalacticHypernova

I mean, this would be x10 easier if the html's sections are actually one element each. I'll try it out real quick

GalacticHypernova avatar Mar 26 '24 18:03 GalacticHypernova

@vejja I have discovered some things:

  1. It appears as though each section returns a single string in an array (it appears as though they are concatenated with \n and +). image

  2. Script elements cannot be nested, which is great, as that takes away most of the complexities with checking script within script.

  3. Unfortunately the script contents don't have their own dedicated spot.

  4. It appears as though my comment about the inside console.log and the likes is not even possible, which is also great as that takes away some more complexities (tested in the template), as it errors stating "invalid end tag", and also translates to dev app startup failure. This should get tested a bit more (like production env), but this probably means that we can safely assume means the actual end of the script. image image image

GalacticHypernova avatar Mar 26 '24 19:03 GalacticHypernova

@vejja I have discovered some things:

  1. It appears as though each section returns a single string in an array (it appears as though they are concatenated with \n and +).

I think at least in head there can be several items in the array. Probably also depends on Nuxt settings such as nuxt.config.ts and useHead parameters ?

  1. Script elements cannot be nested, which is great, as that takes away most of the complexities with checking script within script.
  2. Unfortunately the script contents don't have their own dedicated spot.
  3. It appears as though my comment about the inside console.log and the likes is not even possible, which is also great as that takes away some more complexities (tested in the template), as it errors stating "invalid end tag", and also translates to dev app startup failure. This should get tested a bit more (like production env), but this probably means that we can safely assume means the actual end of the script.

Very interesting, seems that your regex is good then ?

vejja avatar Mar 26 '24 20:03 vejja

Very interesting, seems that your regex is good then ?

I mean, I'm not entirely sure yet, becasue as I said, this still needs to get thoroughly tested to not have its own edge cases, but from what I had gathered so far, through layouts/app.vue/pages, both in templates and through useHead for programmatic script insertion, it errors on both build and dev, but I only tested it for like 20 minutes so I would try to test it a bit further before making the final verdict, but I guess for the most part this means the regex is indeed good

GalacticHypernova avatar Mar 26 '24 20:03 GalacticHypernova

@Baroshem you're probably way more experienced in nuxt than me, could I get a confirmation that doing anything with "" (like in a console.log) inside a script is disallowed and will fail to compile, meaning there is no special handling to add to such case?

GalacticHypernova avatar Mar 26 '24 21:03 GalacticHypernova

@Baroshem you're probably way more experienced in nuxt than me, could I get a confirmation that doing anything with "" (like in a console.log) inside a script is disallowed and will fail to compile, meaning there is no special handling to add to such case?

Hi @GalacticHypernova I just verified, and it is not possible to include </script> in a javascript string. Trying to do so triggers a unterminated string literal Javascript error.

vejja avatar Mar 27 '24 13:03 vejja

Hey guys, sorry but I was wuite busy for the last few days.

If you need my opinion on something please let me know and I will try to answer the best I can :)

Baroshem avatar Mar 29 '24 06:03 Baroshem

I just verified, and it is not possible to include </script> in a javascript string. Trying to do so triggers a unterminated string literal Javascript error.

Thank you! This does very much simplify it.

If you need my opinion on something please let me know and I will try to answer the best I can :)

Thanks! And it is perfectly fine!

GalacticHypernova avatar Mar 29 '24 16:03 GalacticHypernova

@vejja @Baroshem this is not yet finished, however I wanted to ask what your opinions on the current implementation is, and if there's anything I should change (the errors in the tests don't seem quite right considering I'm directly modifying them as strings and return the modifies ones so I don't quite know where the origin of the error is).

GalacticHypernova avatar Mar 31 '24 16:03 GalacticHypernova

@vejja @Baroshem this is not yet finished, however I wanted to ask what your opinions on the current implementation is, and if there's anything I should change (the errors in the tests don't seem quite right considering I'm directly modifying them as strings and return the modifies ones so I don't quite know where the origin of the error is).

My opinion is that it looks good, thanks !!

The first test that fails is it('injects integrity on Nuxt root scripts') In that test, the section head has only one array item, and it is a string which is roughly in this format :

<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="modulepreload" as="script" crossorigin href="/_nuxt/entry.DYpycOZ-.js">
<link rel="modulepreload" as="script" crossorigin href="/_nuxt/index.-yXp0hiT.js">
<link rel="modulepreload" as="script" crossorigin href="/_nuxt/nuxt-link.sHJ0UhRl.js">
<link rel="prefetch" as="style" href="/_nuxt/error-404.c7Cl-q-2.css">
<link rel="prefetch" as="script" crossorigin href="/_nuxt/error-404.DFp4BJVl.js">
<link rel="prefetch" as="script" crossorigin href="/_nuxt/vue.f36acd1f.DbAQkOB5.js">
<link rel="prefetch" as="style" href="/_nuxt/error-500.DacwWOBj.css">
<link rel="prefetch" as="script" crossorigin href="/_nuxt/error-500.DKZOBwlI.js">
<script type="module" src="/_nuxt/entry.DYpycOZ-.js" crossorigin></script>

The SCRIPT_RE and LINK_RE regexes in 03-subresourceIntegrity don't find the correct strings to inject integrity attributes, which are in lines 3, 4, 5 and 11 (respectively: link to entry.js, link to index.js, link to nuxt-link.js, and script module entry.js).

Update: this is due to the presence of minus (-) hyphen in the src and href strings, which is excluded by the w regex check. I think you can delete the w check, as anything between the two " quotations mark would fit. This comment is also true for the integrity checks

vejja avatar Apr 02 '24 14:04 vejja