msgspec regex: support alternate regex engines

Description

I'm currently converting a JSON schema that uses Unicode codepoint properties like \p{L} in regex patterns.

These are one of the assorted JSON schema regex features that aren't supported by the standard library's re module, but are supported by the regex module.

https://github.com/jcrist/msgspec/blob/bc60e96772c5e8a3babff967d86a9e7dfcdbfb1b/msgspec/_core.c#L22222 is currently hardcoded to use the standard library's regex engine. I'm planning to work around that by fiddling with the sys.modules cache while loading msgspec for the first time, but that kind of workaround is always annoyingly fragile, so it would be nice to have a supported way to say "use regex, not re".

From a usage point of view, the simplest option would be if msgspec itself tried to import regex first, and treated re as a fallback if regex wasn't available.

Alternatively, potentially a module level API that allowed a reference to regex.compile to be passed in to update the cached re_compile reference in the module state.

Jun 24 '25 17:06 ncoghlan

What if we supported passing a duck-typed re.Pattern instance to Meta instead of just a string? Then you could use any engine you want that matches the re interface (regex, re, re2, ...) provided you passed the pattern instead of the string:

StringMatchingPattern = Annotated[str, Meta(pattern=regex.compile("your-pattern-here"))]

This would be a simple fix, happy to push it up if that'd be sufficient for your needs.

Jun 24 '25 17:06 jcrist

That would definitely be cleaner.

datamodel-code-generator would need some updates to make the regex engine used configurable there instead, but that feels like a more appropriate place to be specifying that than fiddling with module level state in msgspec.

Jun 24 '25 17:06 ncoghlan

Naïve question born out of past bad experiences (and lack of testing this particular change): what are the implications of more imports to support types on memory and startup performance? We also always import decimal, datetime, etc. when the extension module loads. Is it feasible to make that dynamic?

Nov 23 '25 18:11 ofek

They can be quite significant. I think making some of them dynamic is sensible, and it's something I wanted to try with the re module after / if #925 gets merged.

Making it dynamic has an associated overhead though, so if we use any of them inside a hot(ish) path, it might be worth it to pay the cost upfront. Especially for super common modules that trade-off might be worth it.

Nov 23 '25 18:11 provinzkraut

I've taken a cursory look at the imports, and it seems that the only module that can easily be imported lazily is re. The other ones are all used in performance critical paths.

I think it would still be possible to delay the imports of some of them by figuring out if they're needed before they are needed (e.g. if you know a type has a uuid.UUID field, you know you need the uuid module, even before you encounter the type during encode/decode), but that would add quite a bit of complexity, which I'm not sure is worth the small reduction in import time?

But maybe there's also a super smart way of doing these lazy-imports with (near) zero cost, that I'm not aware of. I wouldn't know :)

Nov 23 '25 19:11 provinzkraut