regexp-make-js icon indicating copy to clipboard operation
regexp-make-js copied to clipboard

Disallow irregular patterns

Open bergus opened this issue 10 years ago • 8 comments

The current regex grammar is quite liberal about SyntaxCharacters that appear in places where they are not allowed. For example, /a{/ is just taken literally.

This complicates parsing, and deciding what context we are in to escape the interpolation value. As an example, what would RegExp.make a{${"1,}b"}do, is it equivalent to/a{1,}b/`?

We should ensure that only well-formed patterns (where interpolation "holes" can produce Atoms, DecimalDigits etc) are allowed as arguments to make, and everything else throw a SyntaxError.

bergus avatar Aug 12 '15 04:08 bergus

It'd be ideal if we didn't have to use try/catch to handle error cases :-/ one of the primary use cases is handling user input, and if it could throw (like JSON.parse) then our code has to be littered with try/catches, or wrapped in a Promise.

ljharb avatar Aug 12 '15 04:08 ljharb

@ljharb Nah, I meant that RegExp.make a{ ${input}`` always throws, regardless of the user input, because unmatched braces are a programmer mistake (and make the implementation of make harder).
Of course, RegExp.make would not throw on the interpolation values (as long as they have the expected Type, e.g. string/regex, and don't throw on coercion); all character sequences are valid and do not need to meet any pattern.

bergus avatar Aug 12 '15 05:08 bergus

ah, ok awesome - throws for programmer input are awesome :-)

ljharb avatar Aug 12 '15 05:08 ljharb

So throw on errors in literal sections per early errors but should never throw because of an interpolated value?

mikesamuel avatar Aug 12 '15 06:08 mikesamuel

Well, it wouldn't be exactly a compile-time early error (like they are possible for regex literals), as RegExp.make gets to see the raw strings not before its invocation, but yes, it would only throw these errors because of syntactic issues in the literal sections. Interpolation values are always escaped literally, and don't need to conform to any syntactic grammar.

should never throw because of an interpolated value?

OT, but a strict "never" is not possible of course. We will always need to throw in cases like

RegExp.make `x{${ -1 }}` // repeat -1 times
RegExp.male `^${ {toString(){ throw "buh!"; }} }` // better not coerce that

To precise: RegExp.make must never throw when interpolating a primitive string value in the context of an Atom (the original purpose of RegExp.escape).

bergus avatar Aug 12 '15 07:08 bergus

Fair enough.

So constructs that are problematic from regexp_match_web_reality:

  1. \c not followed by a control code
  2. \x not followed by 2 x hex and \u not followed by (4 x hex or { hex+ })
  3. { not followed by a proper range then a } or not interpolable to the same.
  4. [ not closed by a ]
  5. character groups like \w used in a charset range
  6. zero-width assertions like \B used in a charset.

Not on that list, but patterns that could indicate a disconnect between author and interpreter:

  1. (? followed by a character that has a special meaning in other regular expression systems like (?i:, (?i, (?<!
  2. unicode character classes \p, [:Lu:] in character classes.

mikesamuel avatar Aug 13 '15 13:08 mikesamuel

Oh, there's already a list of them? Yeah, it would be nice if we could deprecate them right from the beginning :-) I'm specifically concerned about everything that would change the context for interpolation holes, like 3) for quantifiers or 4) for classes, but might not if it's not properly closed. Also the incomplete escape sequences are a hazard if directly followed by an interpolated value.

I think we will need to allow those patterns like (?i:, (?i, (?<! though, as they could be used by an extend regular grammar (of the engine) or a subclass (userland). RegExp.make doesn't need to deal with them; if not supported the RegExp constructor will throw on them anyway (or are the browsers where it doesn't?). However, a pattern like …(?${input})… should be considered irregular indeed.

bergus avatar Aug 13 '15 22:08 bergus

Agreed on (?${input}).

To summarize:

We should fail on interpolations in some contexts.

  • \c${...}
  • \u${...}
  • \x${...}
  • (?${...})`

Re lookbehind, should we dissallow interplations in

  • (?<${...})

where the valence of the interpolation has not yet been specified.

What about in charsets on either side of a -?

  • [a-${...}]
  • [${...}-z]

mikesamuel avatar Sep 03 '15 14:09 mikesamuel