es6-shim
es6-shim copied to clipboard
update String#trim
Check for two commonly erroneously trimmed characters instead of one, and check for erroneously failing to trim the trimmable whitespace characters; then use aa*
instead of a+
, and two regexes instead of one, both because they are more performant (it is also more performant to first trim from the beginning and then trim from the end).
I would also replace if (typeof this === 'undefined' || this === null)
with if (this == null)
but I don't know whether that's allowed by this library's style rules.
I would also update the the minified shim file, but I suspect that's something the maintainers instead do before every proper release.
Thanks for the contribution!
You're correct about the style and the proper time to update minified files.
Can you also provide a test case that would fail without this change?
Also, I'd prefer to separate the bug fix PR from the performance PR, if possible.
Please also rebase out typo fix commits and force push :-)
Can you also link to the ES6 spec and show where these characters should be included?
Can you also link to the ES6 spec and show where these characters should be included?
The spec is here: http://ecma-international.org/ecma-262/6.0/#sec-white-space
Throughout the spec, WhiteSpace
consists of U+0009, U+000B, U+000C, U+0020, U+00A0, and U+FEFF + all code points in the Zs
category as per Unicode v5.1.0 + any additional code points in the Zs
category in the Unicode version used by the engine. In regexpu, I assume the latest available Unicode version (i.e. Unicode v8.0.0 at the time of writing) for the latter.
https://github.com/mathiasbynens/regexpu/blob/ff17a00b63a017a69fd93da455a8944eb18918ce/scripts/character-class-escape-sets.js#L72-L80
The algorithm for String.prototype.trim
specifies that “the definition of white space is the union of WhiteSpace
and LineTerminator
”. So, the regular expression is equivalent to /\s/u
in a conforming ES6 engine: https://mothereff.in/regexpu#%2F%5Cs%2Fu This, in turn, is equivalent to /[\t-\r \xA0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000\uFEFF]/
. This regex is much simpler than the one that is currently used.
So are the code points we're missing ones that were added in later Unicode versions? I'm still not clear on the problem here, but I do want to conform :-)
Looks like this patch doesn’t add any code points to the whitespace set. It only checks
'\u0085\u002B'
instead of just '\u0085'
for hasStringBug
.
The patch description mentions these are commonly erroneously trimmed. I doubt the spec has info on that, @ljharb. Perhaps @lewisje can tell us which engines incorrectly trim \u002B
?
I don't have information on any engines that commonly erroneously trim either of those characters: I just know that this shim checks \x85
and es5-shim checks \u200B
, so I thought both of them were erroneously trimmed widely enough to be checked.
Whitespace is defined in section 7.2 of the ES5 spec: http://es5.github.io/#x7.2
It includes \x09
, \x0B
, \x0C
, \x20
, \xA0
, and \uFEFF
and makes reference to the Unicode category "Zs": http://www.unicode.org/Public/8.0.0/ucd/PropList.txt
Both U+180E MONGOLIAN VOWEL SEPARATOR and U+200B ZERO WIDTH SPACE were re-classified from Zs (Space_Separator) to Cf (Format), while U+0085 NEXT LINE (NEL) is in Cc (Control); U+180E was moved in Unicode 6.3, U+200B was moved in Unicode 4.0.1, and U+0085 has been in Cc at least since Unicode 1.1.5.
Line terminators are defined in 7.3 as \x0A
, \x0D
, \u2028
, and \u2029
: http://es5.github.io/#x7.3
The spec says, in 15.5.4.20, that String#trim must remove "both leading and trailing white space" where "white space" means both whitespace and line terminators: http://es5.github.io/#x15.5.4.20
The ES6 spec has the same definitions for "whitespace" and "line terminator" as the ES5 spec, in Tables 32 and 33, respectively.
The ES5 spec also says that Unicode 3.0 or later is followed, which means that U+180E and U+200B may be considered non-trimmable, and also that U+0085, despite appearing in Unicode's "White_Space" list, must be considered non-trimmable, because it is not in the explicit list of whitespace or line terminators and is not in Zs: http://es5.github.io/#x2
However, the ES6 spec says that Unicode 5.1.0 or later is followed, which means that U+0085 and U+200B must be considered non-trimmable, while U+180E may be considered non-trimmable.
This means that the test for improperly trimmed characters in this shim must include both \x85
and \u200B
, and in es-shims/es5-shim#316, it must include \x85
but not necessarily \u200B
(it currently includes \u200B
but not \x85
); similarly, this shim may need to remove \u180E
from the trim regex, depending on what version of Unicode it's targeting (and then add it to the improper-trim test, if a consistent API is desired).
Sounds good - we still need an automated test to prevent your fix from being accidentally removed later.
However, the ES6 spec says that Unicode 5.1.0 or later is followed, which means that U+0085 and U+200B must be considered non-trimmable, while U+180E may be considered non-trimmable.
U+180E is Zs
as per Unicode 5.1.0 so it must be considered trimmable in order to be ES6-compliant.
Then this means this shim is specifically targeting 5.1.0, thanks for letting me know.
I'll get a test in later today, and then feel the frustration that comes with not being able to rebase from GitHub's Web-based interface.
Thanks! Totally agree that it's lame you can't rebase on the web
Then this means this shim is specifically targeting 5.1.0, thanks for letting me know.
The ES6 spec is, and this shim aims to follow that.
Thanks for bringing this up! We might be able to simplify the regexp (see my earlier comment: https://github.com/paulmillr/es6-shim/pull/354#issuecomment-115555788).
The downside to using ranges in the regexp string is that it couldn't be re-purposed to check for failure to trim all trimmable whitespace (although it would save characters to use \t
and \r
and the like instead of the \x or \u escapes); also, I'm not sure we should rely on the u
flag being available for regexes.
It's ok to make multiple regexes rather than repurposing - the cost of that is negligible.
I'm not sure we should rely on the
u
flag being available for regexes.
I never said we should! That wouldn’t work in any existing browser. I was suggesting /[\t-\r \xA0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000\uFEFF]/
(i.e. what /\s/u
translates to).
@lewisje Are you planning on completing this PR?
I am, but I got distracted :disappointed:
I made U+200B
a separate test because if I combined them, and an implementation fails the "should trim on both sides" test, it could end up passing a test involving \u0085\u200B
(say, if it normally trims \u200B
but doesn't trim from the right to begin with).
Maybe this means, though, that in es6-shim.js I should have also tested those two characters separately instead of in the same string.