regjsparser
regjsparser copied to clipboard
Merge characterClassEscape and dot type?
This came to my mind when preparing the presentation for Amsterdam.JS:
Currently there is a special type="dot"
for things like /./
. Per see there is nothing wrong with this, but the type feels very similar to type="characterClassEscape "
. How do you feel to merge the type characterClassEscape
and dot
? Maybe into specialCharacterClass
?
Or, alternative idea: similar to how different types got merged into type=value
, merge dot
, characterClassEscape
and the existing characterClass
into characterClass
and add a new kind
entry? I like this, as it not only gets away with the type dot
, but also with the type characterClassEscape
, which sounds similar to characterClass
, but is still completly different although similar. Like:
{
type: "characterClass",
kind: "range",
body: [ { type: "characterClassRange", ...} ]
}
{
type: "characterClass",
kind: "singleChar",
char: "d"
// The body is the not needed here
// body: [ ]
}
This looks interesting to me, but I dislike the inconsistency by using body
in one case and char
in the other one to encode the "meaning" of the characterClass. In the case of value
, all the different kinds have a codePoint
entry. A possible way to achieve a similar feeling of consistency here could be to store on the body
of the type: "characterClass
in the case of the kind: "singleChar"
the actual ranges that are matched. E.g. in the case of /\d/
:
{
type: "characterClass",
kind: "singleChar",
body: [ {type: "characterClassRange", from: 48, to: 57} ],
raw: "\d"
}
Looks nice, but encoding /\s/
this way will result in a very large body :/ Here are the two functions used in RegExp.JS to test for a /\s/
string:
function isWhiteSpace(ch) {
return (ch === 32) || // space
(ch === 9) || // tab
(ch === 0xB) ||
(ch === 0xC) ||
(ch === 0xA0) ||
(ch >= 0x1680 && '\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u202F\u205F\u3000\uFEFF'.indexOf(String.fromCharCode(ch)) > 0);
}
// 7.3 Line Terminators
function isLineTerminator(ch) {
return (ch === 10) || (ch === 13) || (ch === 0x2028) || (ch === 0x2029);
}
Personally, I am not sure if the consistency is worth the larger AST output here.
So, maybe go with specialCharacterClass
and characterClass
? Any thoughts? Or do you think merging dot
into a different type
is not worth the efford and this issue should be closed right away ;)?