arktype
arktype copied to clipboard
Type-safe regex
This would involve updating the static parser to give more detailed feedback and inference for regex literals embedded in strings.
The general goals would be:
- If a RegExp will throw an error when instantiated via new RegExp, we should try and have a corresponding type error describing the problem
- If a RegExp implies anything about the contents of a string that would match it that can be expressed using template literals, we should infer its type.
I think there may be limitations on (2) in particular (lookaheads etc could be complicated?) but we can go for the low hanging fruit, e.g. literal strings embedded in regex.
Here are some basic cases that should be covered:
// should be a parse error
type("/(unterminated_group/")
// should be inferred as `${string}foo${string}`
type("/foo/")
// should be inferred as `foo${string}`
type("/^foo/")
// should be inferred as `${string}foo`
type("/foo$/")
// should be inferred as `foo${string}bar`
type("/^foo.*bar$/")
Cases like this could also work. It will quickly create types that are not performant to use, so ideally we should internally have a way to bail out if the expressions gets too complex and return string if we end up supporting stuff like this:
// should be inferred as `${string}${0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9}${string}`
type("/\\d/")
// should be inferred as `a${0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ""}b`
type("/^a\\d?b$/")
It wouldn't work with standard raw regex literals like:
// can't narrow because TS just types it as RegExp
type(/^foo.*/)
@didavid61202 is looking into this! 😍
How did you solve the double escape problem?
@Dimava You're right that looks wrong, I'll update the issue!
@Hsiwe Is looking into this now🔥
@ssalbdivad I guess I can give this a shot, per your request. I'm probably going to stick to my current parsing strategy while I finish feature-completeness on my proof-of-concept matcher (I think I've figured out how to force the matching to progress in a sensible left-to-right fashion without doing a character-wise lookahead), but I'm open to working with you to improve the implementation for robustness/performance once I finish doing so.
@Oblarg Are you sure a top-down strategy will work for regex? There were cases I ran into like '1|2'|'3|4'
that were essentially impossible to parse top-down as far as TS syntax which is what motivated me to switch to shift-reduce parsing in the first place.
Haven't fully considered if in every instance a regex control character would interfere with matching it's preceded by an escape or if it's contextual, but I can imagine top down could be problematic in those scenarios.
I'm not sure of anything, but the results so far have been extremely promising. My overall strategy is to first split on "or" operators into a union, and then recursively parse each branch in the union by searching for the leftmost unescaped quantifier or wildcard, performing a prefix match, and recursing on the remaining portions of the match string and regex.
I think this sort of "two-pass" approach (first a structural split into a union, then a sequential parse) has a good chance of working.
But what if you have an escaped foo|\||bar
? It doesn't seem like you'd be able to recover from this top down.
It seems quite similar to the issues I eventually ran into with a similar approach parsing TS syntax at a type-level.
Escape-pruning has proven way easier than anticipated, and I'm fairly sure I can easily force the union-split to ignore escaped operators (and prune the escapes).
Remember, though, that i'm only defining a generic type that represents the regex match with conditional expansion. This gives me a lot more tools in my toolbox than if I were writing a first-class type.
@Oblarg Yes I'm aware, ArkType's validation works quite similarly!
I'd have to look into it more. It may be possible if in regex every control character would be directly preceded by an escape character to filter those out somehow, whereas if you have to be able to parse quoted strings top-down definitely can't work.
My intuition is still that you will hit cases where it falls apart though. It doesn't seem like you'd ever be able to handle grouping- matching sets of parens for an arbitrary expression like (foo(bar)*)|(baz)*
definitely seems impossible?
It was a really fun journey to explore all of this at a type-level so I don't want to deprive you of that. Just consider that likely, a top-down approach will have inherent limits that a bottom up approach will not, so if you start to run into those, it may be worth considering using a shift-reduce strategy like the one I mentioned.
I already have grouping working in my proof of concept, though only (so far) for the “or” operator. I suspect it will work for the others.
I'd say just try handling this test case in particular: (foo(bar|quux)*)|(baz)*
If it is possible, I'd definitely be interested to see what that would look like.
I’ll give it a shot when I can!