es6-shim icon indicating copy to clipboard operation
es6-shim copied to clipboard

update String#trim

Open lewisje opened this issue 9 years ago • 17 comments

Check for two commonly erroneously trimmed characters instead of one, and check for erroneously failing to trim the trimmable whitespace characters; then use aa* instead of a+, and two regexes instead of one, both because they are more performant (it is also more performant to first trim from the beginning and then trim from the end).

I would also replace if (typeof this === 'undefined' || this === null) with if (this == null) but I don't know whether that's allowed by this library's style rules.

I would also update the the minified shim file, but I suspect that's something the maintainers instead do before every proper release.

lewisje avatar Jun 26 '15 06:06 lewisje

Thanks for the contribution!

You're correct about the style and the proper time to update minified files.

Can you also provide a test case that would fail without this change?

Also, I'd prefer to separate the bug fix PR from the performance PR, if possible.

Please also rebase out typo fix commits and force push :-)

ljharb avatar Jun 26 '15 06:06 ljharb

Can you also link to the ES6 spec and show where these characters should be included?

ljharb avatar Jun 26 '15 06:06 ljharb

Can you also link to the ES6 spec and show where these characters should be included?

The spec is here: http://ecma-international.org/ecma-262/6.0/#sec-white-space

Throughout the spec, WhiteSpace consists of U+0009, U+000B, U+000C, U+0020, U+00A0, and U+FEFF + all code points in the Zs category as per Unicode v5.1.0 + any additional code points in the Zs category in the Unicode version used by the engine. In regexpu, I assume the latest available Unicode version (i.e. Unicode v8.0.0 at the time of writing) for the latter.

https://github.com/mathiasbynens/regexpu/blob/ff17a00b63a017a69fd93da455a8944eb18918ce/scripts/character-class-escape-sets.js#L72-L80

The algorithm for String.prototype.trim specifies that “the definition of white space is the union of WhiteSpace and LineTerminator”. So, the regular expression is equivalent to /\s/u in a conforming ES6 engine: https://mothereff.in/regexpu#%2F%5Cs%2Fu This, in turn, is equivalent to /[\t-\r \xA0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000\uFEFF]/. This regex is much simpler than the one that is currently used.

mathiasbynens avatar Jun 26 '15 07:06 mathiasbynens

So are the code points we're missing ones that were added in later Unicode versions? I'm still not clear on the problem here, but I do want to conform :-)

ljharb avatar Jun 26 '15 07:06 ljharb

Looks like this patch doesn’t add any code points to the whitespace set. It only checks '\u0085\u002B' instead of just '\u0085' for hasStringBug.

The patch description mentions these are commonly erroneously trimmed. I doubt the spec has info on that, @ljharb. Perhaps @lewisje can tell us which engines incorrectly trim \u002B?

mathiasbynens avatar Jun 26 '15 07:06 mathiasbynens

I don't have information on any engines that commonly erroneously trim either of those characters: I just know that this shim checks \x85 and es5-shim checks \u200B, so I thought both of them were erroneously trimmed widely enough to be checked.

Whitespace is defined in section 7.2 of the ES5 spec: http://es5.github.io/#x7.2

It includes \x09, \x0B, \x0C, \x20, \xA0, and \uFEFF and makes reference to the Unicode category "Zs": http://www.unicode.org/Public/8.0.0/ucd/PropList.txt

Both U+180E MONGOLIAN VOWEL SEPARATOR and U+200B ZERO WIDTH SPACE were re-classified from Zs (Space_Separator) to Cf (Format), while U+0085 NEXT LINE (NEL) is in Cc (Control); U+180E was moved in Unicode 6.3, U+200B was moved in Unicode 4.0.1, and U+0085 has been in Cc at least since Unicode 1.1.5.

Line terminators are defined in 7.3 as \x0A, \x0D, \u2028, and \u2029: http://es5.github.io/#x7.3

The spec says, in 15.5.4.20, that String#trim must remove "both leading and trailing white space" where "white space" means both whitespace and line terminators: http://es5.github.io/#x15.5.4.20

The ES6 spec has the same definitions for "whitespace" and "line terminator" as the ES5 spec, in Tables 32 and 33, respectively.

The ES5 spec also says that Unicode 3.0 or later is followed, which means that U+180E and U+200B may be considered non-trimmable, and also that U+0085, despite appearing in Unicode's "White_Space" list, must be considered non-trimmable, because it is not in the explicit list of whitespace or line terminators and is not in Zs: http://es5.github.io/#x2

However, the ES6 spec says that Unicode 5.1.0 or later is followed, which means that U+0085 and U+200B must be considered non-trimmable, while U+180E may be considered non-trimmable.

This means that the test for improperly trimmed characters in this shim must include both \x85 and \u200B, and in es-shims/es5-shim#316, it must include \x85 but not necessarily \u200B (it currently includes \u200B but not \x85); similarly, this shim may need to remove \u180E from the trim regex, depending on what version of Unicode it's targeting (and then add it to the improper-trim test, if a consistent API is desired).

lewisje avatar Jun 26 '15 07:06 lewisje

Sounds good - we still need an automated test to prevent your fix from being accidentally removed later.

ljharb avatar Jun 26 '15 07:06 ljharb

However, the ES6 spec says that Unicode 5.1.0 or later is followed, which means that U+0085 and U+200B must be considered non-trimmable, while U+180E may be considered non-trimmable.

U+180E is Zs as per Unicode 5.1.0 so it must be considered trimmable in order to be ES6-compliant.

mathiasbynens avatar Jun 26 '15 07:06 mathiasbynens

Then this means this shim is specifically targeting 5.1.0, thanks for letting me know.

I'll get a test in later today, and then feel the frustration that comes with not being able to rebase from GitHub's Web-based interface.

lewisje avatar Jun 26 '15 08:06 lewisje

Thanks! Totally agree that it's lame you can't rebase on the web

ljharb avatar Jun 26 '15 08:06 ljharb

Then this means this shim is specifically targeting 5.1.0, thanks for letting me know.

The ES6 spec is, and this shim aims to follow that.

Thanks for bringing this up! We might be able to simplify the regexp (see my earlier comment: https://github.com/paulmillr/es6-shim/pull/354#issuecomment-115555788).

mathiasbynens avatar Jun 26 '15 08:06 mathiasbynens

The downside to using ranges in the regexp string is that it couldn't be re-purposed to check for failure to trim all trimmable whitespace (although it would save characters to use \t and \r and the like instead of the \x or \u escapes); also, I'm not sure we should rely on the u flag being available for regexes.

lewisje avatar Jun 26 '15 13:06 lewisje

It's ok to make multiple regexes rather than repurposing - the cost of that is negligible.

ljharb avatar Jun 26 '15 15:06 ljharb

I'm not sure we should rely on the u flag being available for regexes.

I never said we should! That wouldn’t work in any existing browser. I was suggesting /[\t-\r \xA0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000\uFEFF]/ (i.e. what /\s/u translates to).

mathiasbynens avatar Jun 26 '15 22:06 mathiasbynens

@lewisje Are you planning on completing this PR?

ljharb avatar Jul 20 '15 05:07 ljharb

I am, but I got distracted :disappointed:

lewisje avatar Jul 20 '15 17:07 lewisje

I made U+200B a separate test because if I combined them, and an implementation fails the "should trim on both sides" test, it could end up passing a test involving \u0085\u200B (say, if it normally trims \u200B but doesn't trim from the right to begin with).

Maybe this means, though, that in es6-shim.js I should have also tested those two characters separately instead of in the same string.

lewisje avatar Aug 23 '15 06:08 lewisje