terser icon indicating copy to clipboard operation
terser copied to clipboard

utf-8 characters in comments remain with ascii_only: true

Open bendmorris opened this issue 3 years ago • 6 comments

The output.ascii_only option escapes multibyte characters in strings and regex according to the documentation - and in practice, also variable names, so one might reasonably expect that with this option, they'll get an ascii artifact.

However, comments preserved with the output.comments option retain multibyte characters despite this setting.

Version (complete output of terser -V or specific git commit)

5.15.1

Complete CLI command or minify() options used

{
  "output": {
    "ascii_only": true,
    "comments": "preserve"
  }
}

terser input

const ಠ_ಠ = 1;
const myString = 'ಠ_ಠ';
// preserve: ಠ_ಠ

terser output or error

const \u0ca0_\u0ca0=1,myString="\u0ca0_\u0ca0";
// preserve: ಠ_ಠ

Expected result

const \u0ca0_\u0ca0=1,myString="\u0ca0_\u0ca0";
// preserve: \u0ca0_\u0ca0

bendmorris avatar Oct 13 '22 02:10 bendmorris

You can't escape plain text, and even JSDoc/etc doesn't seem to support escape characters (so not even that makes sense). Hence it doesn't make any sense to escape comments, because the output would practically be "garbage" (i.e. a human can't meaningfully read escape codes). The logical option would be crush/transliterate the characters to ASCII and lose some information (because you're asking to represent something ASCII simply can't). Is the escaped "garbage" preferable for some esoteric reason? Maybe. But I would say that fundamentally there is no "correct" here, because there is no formal mechanism that works. You could even argue that the current behavior is the preferable one, it depends on the exact expectations/promises of ascii_only.

syranide avatar Oct 13 '22 12:10 syranide

Unicode in comments also won't have any effect on the running code, and transforming them to escapes would actually make the output larger. We support ascii_only despite this in strings/identifiers because there are observable bugs if the server isn't properly setup to server UTF-8 content types.

jridgewell avatar Oct 13 '22 14:10 jridgewell

You could even argue that the current behavior is the preferable one, it depends on the exact expectations/promises of ascii_only.

I was leaning towards this - but (1) the naming "output.ascii_only" implies that you get, well, output that's ascii only, and (2) the documentation only specifies strings but it also applies to variable names, so clearly the documentation isn't exhaustive.

Unicode in comments also won't have any effect on the running code

The reason I noticed this is that most JS engines (including JavaScriptCore, V8) will use more memory to represent the source if it contains any non-ascii characters - one byte per character if none, two if any. So that one em-dash in a comment in your 6 MB JS bundle will cause it to use 12 MB of memory, which is a nontrivial increase. This was actually causing real issues for us in a very memory constrained environment (smart TVs) and was hard to track down.

bendmorris avatar Oct 13 '22 14:10 bendmorris

The reason I noticed this is that most JS engines (including JavaScriptCore, V8) will use more memory to represent the source if it contains any non-ascii characters - one byte per character if none, two if any. So that one em-dash in a comment in your 6 MB JS bundle will cause it to use 12 MB of memory, which is a nontrivial increase. This was actually causing real issues for us in a very memory constrained environment (smart TVs) and was hard to track down.

Interesting! But also sounds kind of fishy? Unless you mean only during parsing then I'm not sure why this would have a lasting effect, logically it would parse it into its AST representation and then discard the source code and all comments that go along with it. Unless there's some development features that keep the source around for debugging/etc? Not my expertise.

syranide avatar Oct 13 '22 14:10 syranide

JS engines do hold onto the source string, usually as either a one byte per character string or UTF-16. Function.toString() more or less requires them to keep the source and from what I've seen, most engines just copy the entire string and keep it in memory.

bendmorris avatar Oct 13 '22 14:10 bendmorris

I've read this issue multiple times but haven't weighed in yet. Since the reason for ascii_only is servers and tooling, it wouldn't make sense to keep multibyte characters in comments. I propose adding an ascii_only_comments, which by default is on if ascii_only is on.

fabiosantoscode avatar Feb 05 '23 10:02 fabiosantoscode