regjsparser icon indicating copy to clipboard operation
regjsparser copied to clipboard

Merge characterClassEscape and dot type?

Open jviereck opened this issue 10 years ago • 0 comments

This came to my mind when preparing the presentation for Amsterdam.JS:

Currently there is a special type="dot" for things like /./. Per see there is nothing wrong with this, but the type feels very similar to type="characterClassEscape ". How do you feel to merge the type characterClassEscape and dot? Maybe into specialCharacterClass?

Or, alternative idea: similar to how different types got merged into type=value, merge dot, characterClassEscape and the existing characterClass into characterClass and add a new kind entry? I like this, as it not only gets away with the type dot, but also with the type characterClassEscape, which sounds similar to characterClass, but is still completly different although similar. Like:

{
  type: "characterClass",
  kind: "range",
  body: [ { type: "characterClassRange", ...} ]
}

{
  type: "characterClass",
  kind: "singleChar",
  char: "d"
  // The body is the not needed here
  // body: [ ]   
}

This looks interesting to me, but I dislike the inconsistency by using body in one case and char in the other one to encode the "meaning" of the characterClass. In the case of value, all the different kinds have a codePoint entry. A possible way to achieve a similar feeling of consistency here could be to store on the body of the type: "characterClass in the case of the kind: "singleChar" the actual ranges that are matched. E.g. in the case of /\d/:

{
  type: "characterClass",
  kind: "singleChar",
  body: [ {type: "characterClassRange", from: 48, to: 57} ],   
  raw: "\d"
}

Looks nice, but encoding /\s/ this way will result in a very large body :/ Here are the two functions used in RegExp.JS to test for a /\s/ string:

function isWhiteSpace(ch) {
    return (ch === 32) ||  // space
        (ch === 9) ||      // tab
        (ch === 0xB) ||
        (ch === 0xC) ||
        (ch === 0xA0) ||
        (ch >= 0x1680 && '\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u202F\u205F\u3000\uFEFF'.indexOf(String.fromCharCode(ch)) > 0);
}

// 7.3 Line Terminators

function isLineTerminator(ch) {
    return (ch === 10) || (ch === 13) || (ch === 0x2028) || (ch === 0x2029);
}

Personally, I am not sure if the consistency is worth the larger AST output here.

So, maybe go with specialCharacterClass and characterClass? Any thoughts? Or do you think merging dot into a different type is not worth the efford and this issue should be closed right away ;)?

jviereck avatar Sep 05 '14 09:09 jviereck