arktype icon indicating copy to clipboard operation
arktype copied to clipboard

Type-safe regex

Open ssalbdivad opened this issue 1 year ago • 13 comments

This would involve updating the static parser to give more detailed feedback and inference for regex literals embedded in strings.

The general goals would be:

  1. If a RegExp will throw an error when instantiated via new RegExp, we should try and have a corresponding type error describing the problem
  2. If a RegExp implies anything about the contents of a string that would match it that can be expressed using template literals, we should infer its type.

I think there may be limitations on (2) in particular (lookaheads etc could be complicated?) but we can go for the low hanging fruit, e.g. literal strings embedded in regex.

Here are some basic cases that should be covered:

// should be a parse error
type("/(unterminated_group/")

// should be inferred as `${string}foo${string}`
type("/foo/")


// should be inferred as `foo${string}`
type("/^foo/")

// should be inferred as `${string}foo`
type("/foo$/")

// should be inferred as `foo${string}bar`
type("/^foo.*bar$/")

Cases like this could also work. It will quickly create types that are not performant to use, so ideally we should internally have a way to bail out if the expressions gets too complex and return string if we end up supporting stuff like this:

// should be inferred as `${string}${0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9}${string}`
type("/\\d/")

// should be inferred as `a${0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ""}b`
type("/^a\\d?b$/")

It wouldn't work with standard raw regex literals like:

// can't narrow because TS just types it as RegExp
type(/^foo.*/)

ssalbdivad avatar Mar 27 '23 17:03 ssalbdivad

@didavid61202 is looking into this! 😍

ssalbdivad avatar Apr 18 '23 11:04 ssalbdivad

How did you solve the double escape problem?

Dimava avatar May 05 '23 18:05 Dimava

@Dimava You're right that looks wrong, I'll update the issue!

ssalbdivad avatar May 05 '23 18:05 ssalbdivad

@Hsiwe Is looking into this now🔥

ssalbdivad avatar Nov 07 '23 18:11 ssalbdivad

@ssalbdivad I guess I can give this a shot, per your request. I'm probably going to stick to my current parsing strategy while I finish feature-completeness on my proof-of-concept matcher (I think I've figured out how to force the matching to progress in a sensible left-to-right fashion without doing a character-wise lookahead), but I'm open to working with you to improve the implementation for robustness/performance once I finish doing so.

Oblarg avatar Sep 09 '24 15:09 Oblarg

@Oblarg Are you sure a top-down strategy will work for regex? There were cases I ran into like '1|2'|'3|4' that were essentially impossible to parse top-down as far as TS syntax which is what motivated me to switch to shift-reduce parsing in the first place.

Haven't fully considered if in every instance a regex control character would interfere with matching it's preceded by an escape or if it's contextual, but I can imagine top down could be problematic in those scenarios.

ssalbdivad avatar Sep 09 '24 15:09 ssalbdivad

I'm not sure of anything, but the results so far have been extremely promising. My overall strategy is to first split on "or" operators into a union, and then recursively parse each branch in the union by searching for the leftmost unescaped quantifier or wildcard, performing a prefix match, and recursing on the remaining portions of the match string and regex.

I think this sort of "two-pass" approach (first a structural split into a union, then a sequential parse) has a good chance of working.

Oblarg avatar Sep 09 '24 16:09 Oblarg

But what if you have an escaped foo|\||bar? It doesn't seem like you'd be able to recover from this top down.

It seems quite similar to the issues I eventually ran into with a similar approach parsing TS syntax at a type-level.

ssalbdivad avatar Sep 09 '24 16:09 ssalbdivad

Escape-pruning has proven way easier than anticipated, and I'm fairly sure I can easily force the union-split to ignore escaped operators (and prune the escapes).

Remember, though, that i'm only defining a generic type that represents the regex match with conditional expansion. This gives me a lot more tools in my toolbox than if I were writing a first-class type.

Oblarg avatar Sep 09 '24 16:09 Oblarg

@Oblarg Yes I'm aware, ArkType's validation works quite similarly!

I'd have to look into it more. It may be possible if in regex every control character would be directly preceded by an escape character to filter those out somehow, whereas if you have to be able to parse quoted strings top-down definitely can't work.

My intuition is still that you will hit cases where it falls apart though. It doesn't seem like you'd ever be able to handle grouping- matching sets of parens for an arbitrary expression like (foo(bar)*)|(baz)* definitely seems impossible?

It was a really fun journey to explore all of this at a type-level so I don't want to deprive you of that. Just consider that likely, a top-down approach will have inherent limits that a bottom up approach will not, so if you start to run into those, it may be worth considering using a shift-reduce strategy like the one I mentioned.

ssalbdivad avatar Sep 09 '24 16:09 ssalbdivad

I already have grouping working in my proof of concept, though only (so far) for the “or” operator. I suspect it will work for the others.

Oblarg avatar Sep 09 '24 16:09 Oblarg

I'd say just try handling this test case in particular: (foo(bar|quux)*)|(baz)*

If it is possible, I'd definitely be interested to see what that would look like.

ssalbdivad avatar Sep 09 '24 16:09 ssalbdivad

I’ll give it a shot when I can!

Oblarg avatar Sep 09 '24 16:09 Oblarg