TypeScript
TypeScript copied to clipboard
Regex-validated string types (feedback reset)
This is a pickup of #6579. With the addition of #40336, a large number of those use cases have been addressed, but possibly some still remain.
Update 2023-04-11: Reviewed use cases and posted a write-up of our current evaluation
Search Terms
regex string types
Suggestion
Open question: For people who had upvoted #6579, what use cases still need addressing?
Note: Please keep discussion on-topic; moderation will be a bit heavier to avoid off-topic tangents
Examples
(please help)
Checklist
My suggestion meets these guidelines:
- [?] This wouldn't be a breaking change in existing TypeScript/JavaScript code
- [?] This wouldn't change the runtime behavior of existing JavaScript code
- [?] This could be implemented without emitting different JS based on the types of the expressions
- [?] This isn't a runtime feature (e.g. library functionality, non-ECMAScript syntax with JavaScript output, etc.)
- [?] This feature would agree with the rest of TypeScript's Design Goals.
Use case 1, URL path building libraries,
/*snip*/
createTestCard : f.route()
.append("/platform")
.appendParam(s.platform.platformId, /\d+/)
.append("/stripe")
.append("/test-card")
/*snip*/
These are the constraints for .append(),
- ✔️ Must start with leading forward slash (/)
- ❌ Must not end with trailing forward slash (/)
- ❌ Must not contain colon character (:); it is reserved for parameters
- ❌ Must not contain two, or more, forward slashes consecutively (//)
Use case 2,
- ❌ Hexadecimal/binary/decimal/etc. strings of non-trivial length (explosion of union types)
Use case 3, safer RegExp constructor (and similar functions?),
new(pattern: string, flags?: PatternOf</^[gimsuy]*$/>): RegExp
- ❌
flagsshould only contain the charactersg,i,m,s,u,y - ❌ Each character should only be used once (To be fair, this condition would be hard for regexes, too, requiring negative lookahead or many states)
- ❌ Characters can be specified in any order
Template string type can only be used in conditional type, so it's really a "type validator", not a "type" itself. It also focuses more on manipulating strings, I think it's a different design goal from Regex-validated types.
It's doable to use conditional types to constrain parameters, for example taken from https://github.com/microsoft/TypeScript/issues/6579#issuecomment-710776922
declare function takesOnlyHex<StrT extends string> (
hexString : Accepts<HexStringLen6, StrT> extends true ? StrT : {__err : `${StrT} is not a hex-string of length 6`}
) : void;
However I think this parttern has several issues:
- It's not a common pattern, and cumbersome to repeat every time.
- The type parameter should be inferred, but was used in a condition before it "can" be inferred, which is unintuitive.
- TypeScript still doesn't support partial generic inferrence (#26349) so it may be hard to use this pattern with more generic parameters.
Would this allow me to define type constraints for String to match the XML specification's Name constructs (short summary) and QNames by expressing them as regular expressions? If so, I am all for it :-)
@AnyhowStep It isn't the cleanest, but with conditional types now allowing recursion, it seems we can accomplish these cases with template literal types: playground link
We can have compile-time regular expressions now. But anything requiring conditional types and a generic type param to check is a non-feature to me.
(Well, non-feature when I'm trying to use TypeScript for work. All personal projects have --noEmit enabled because real TS programmers execute in compile-time)
Open question: For people who had upvoted #6579, what use cases still need addressing?
We have a strongly-typed filesystem library, where the user is expected to manipulate "clean types" like Filename or PortablePath versus literal strings (they currently obtain those types by using the as operator on literals, or calling a validator for user-provided strings):
export interface PathUtils {
cwd(): PortablePath;
normalize(p: PortablePath): PortablePath;
join(...paths: Array<PortablePath | Filename>): PortablePath;
resolve(...pathSegments: Array<PortablePath | Filename>): PortablePath;
isAbsolute(path: PortablePath): boolean;
relative(from: PortablePath, to: PortablePath): P;
dirname(p: PortablePath): PortablePath;
basename(p: PortablePath, ext?: string): Filename;
extname(p: PortablePath): string;
readonly sep: PortablePath;
readonly delimiter: string;
parse(pathString: PortablePath): ParsedPath<PortablePath>;
format(pathObject: FormatInputPathObject<PortablePath>): PortablePath;
contains(from: PortablePath, to: PortablePath): PortablePath | null;
}
I'm investigating template literals to remove the as syntax, but I'm not sure we'll be able to use them after all:
- They don't raise errors very well
- Interfaces are a pain to type (both declaration and implementation would have to be generics)
- More generally, we would have to migrate all our existing functions to become generics, and our users would have too
The overhead sounds overwhelming, and makes it likely that there are side effects that would cause problems down the road - causing further pain if we need to revert. Ideally, the solution we're looking for would leave the code above intact, we'd just declare PortablePath differently.
@arcanis it really sounds like you want nominal types (#202), since even if regex types existed, you'd still want the library consumer to go through the validator functions?
I have a strong use case for Regex-validated string types. AWS Lambda function names have a maximum length of 64 characters. This can be manually checked in a character counter but it's unnecessarily cumbersome given that the function name is usually composed with identifying substrings.
As an example, this function name can be partially composed with the new work done in 4.1/4.2. However there is no way to easily create a compiler error in TypeScript since the below function name will be longer than 64 characters.
type LambdaServicePrefix = 'my-application-service';
type LambdaFunctionIdentifier = 'dark-matter-upgrader-super-duper-test-function';
type LambdaFunctionName = `${LambdaServicePrefix}-${LambdaFunctionIdentifier}`;
const lambdaFunctionName: LambdaFunctionName = 'my-application-service-dark-matter-upgrader-super-duper-test-function';
This StackOverflow Post I created was asking this very same question.
With the continued rise of TypeScript in back-end related code, statically defined data would be a likely strong use case for validating the string length or the format of the string.
TypeScript supports literal types, template literal types, and enums. I think a string pattern type is a natural extension that allows for non-finite value restrictions to be expressed.
I'm writing type definitions for an existing codebase. Many arguments and properties accept strings of a specific format:
- ❌ Formatted representation of a date, eg
"2021-04-29T12:34:56" - ❌ Comma-separated list of integers, eg
"1,2,3,4,5000" - ❌ Valid MIME type, eg
"image/jpeg" - ❌ Valid hex colour code, already mentioned several times
- ❌ Valid IPv4 or IPv6 address
I'd like to argue against @RyanCavanaugh's claim in the first post saying that:
a large number of those use cases have been addressed, but possibly some still remain.
As it stands presently TypeScript can't even work with the following type literal:
type Digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9;
type Just5Digits = `${Digit}${Digit}${Digit}${Digit}${Digit}`;
Throwing an "Expression produces a union type that is too complex to represent.(2590)" error.
That's the equivalent of the following regex:
/^\d{5}$/
Just 5 digits in a row.
Almost all useful regexes are more complicated than that, and TypeScript already gives up with that, hence I'd argue the opposite of that claim is true: a small number of use cases have been addressed and the progress with template literals has been mostly orthogonal really.
What about validation of JSON schema's patternProperties regex in TypeScript interfaces for the parsed object? This is a PERFECT application of the regex-validated string feature.
Possible syntax using a matchof keyword:
import { IJSONSchema, IJSONSchemaMap } from 'vs/base/common/jsonSchema';
export const UnscopedKeyPtn: string = '^[^\\[\\]]*$';
export type UnscopedKey = string & matchof RegExp(UnscopedKeyPtn);
export tokenColorSchema: IJSONSchema = {
properties: {},
patternProperties: { [UnscopedKeyPtn]: { type: 'object' } }
};
export interface ITokenColors {
[colorId: UnscopedKey]: string;
}
I just want to add to the need for this because template literals do not behave the way we think explicitly -
type UnionType = {
kind: `kind_${string}`,
one: boolean;
} | {
kind: `kind_${string}_again`,
two: string;
}
const union: UnionType = {
// ~~~~~ > Error here -
/**
Type '{ kind: "type1_123"; }' is not assignable to type 'UnionType'.
Property 'two' is missing in type '{ kind: "type1_123"; }' but required in type '{ kind: `type1_${string}_again`; two: string; }'.ts(2322)
*/
kind: 'type1_123',
}
this shows template literals are not unique and one can be a subset of another while that is not the intention of use. Regex would let us have a $ at the end to denote end of string that would help discriminate between the constituent types of this union clearly.
(CC @Igmat) It occurs to me that there's a leaning towards using regex tests as type literals in #6579, i.e.
type CssColor = /^#([0-9a-fA-F]{3}|[0-9a-fA-F]{4}|[0-9a-fA-F]{6}|[0-9a-fA-F]{8})$/i;
const color: CssColor = '#000000'; // OK
It seems that regexes are usually interpreted as values by the TS compiler. When used as a type, this usually throws an error that keeps types and values as distinct as possible. What do you think of:
- using a *of keyword to cast regex values into a regex-validated type (maybe
matchof) - having a keyword check for conditional types (maybe
matches)
type CssColor = matchof /^#([0-9a-fA-F]{3}|[0-9a-fA-F]{4}|[0-9a-fA-F]{6}|[0-9a-fA-F]{8})$/i;
const color: CssColor = '#000000'; // OK
~~Editing this to note something - the RegExp.prototype.test method can accept numbers and other non-string primitives. I think that's a neat feature. If people want to strictly validate strings, they can use a intersection type with string. 😄~~
TL:DR; regex literal types aren't intuitively and visibly types without explicit regex->type casting, can we propose that?
I'm not sure what the benefit of a separate keyword is here. There doesn't seem to be a case where it could be ambiguous whether the regex is used as a type or as a value, unless I'm missing something? I think https://github.com/microsoft/TypeScript/issues/6579#issuecomment-261519733 and the replies below it already sketch out a syntax that hits the sweet spot of being both succinct and addressing all the use cases.
Regarding the intersection, the input to Regex.prototype.test is always turned into a string first, so that seems superfluous.
Good to know about RegExp.prototype.test.
The ambiguity seems straightforward to me. As we know, TypeScript is a JS superset & regex values can be used as variables.
To me, a regex literal is just not an intuitive type - it doesn't imply "string that matches this regexp restriction". It's common convention to camelcase regex literals and add a "Regex" suffix, but that variable name convention as a type looks really ugly:
export cssColorRegex: RegExp = /^#([0-9a-fA-F]{3}|[0-9a-fA-F]{4}|[0-9a-fA-F]{6}|[0-9a-fA-F]{8})$/i;
const color: cssColorRegex = '#000000'; // OK
// ^ lc 👎 ^ two options:
// - A. use Regex for value clarity but type confusion or
// - B. ditch Regex for unclear value name but clear type name
The original proposal does suggests JSON schemas which would use the regex as a type and a value (if implemented).
Perhaps I wasn't very clear, there doesn't seem to be a case where it would be ambiguous for the compiler whether a regex is a type or a value. Just as you can use string literals both as values and as types:
const foo = "literal"; // Used as a value
const bar: "literal" = foo; // Used as a type
The exact same approach can be applied for regex types without ambiguity.
My concern is that the regex means two different things in the two contexts - literal vs "returns true from RegExp.test method". The latter seems like a type system feature exclusively - it wouldn't be intuitive unless there's syntax to cast the regex into a type
There is also the issue of regex literals and regex types possibly being used as superclasses:
- https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/@@match
- https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/@@split
If all regex literals and type variables are cast into validators implicitly without a keyword, how do we use RegExp interfaces and regex literals with optional methods as a object type?
To me, context loss in https://github.com/microsoft/TypeScript/issues/41160#issuecomment-853419095 is enough reason to add a keyword, but this is another reason. I'm unsure of the name I suggested but I do prefer the use of an explicit type cast.
I would love this! I've had tons of issues that could be easily solved with RegEx types.
For example, a very basic IETF language tag type that accepts strings like "en-GB" or "en-US" but rejects strings that don't match the casing correctly.
Using template literals (doesn't work):
How it could be done easily with RegEx types:
export type CountryCode = /^[a-z]{2}-[A-Z]{2}$/;
(I know that technically you can represent this sort of type, but it's just a simple example)
I was thinking about this a bit, while working on another PR which implements an intrinsic utility function.
I have not read through this or the previous thread very thoroughly, so forgive me if this doesn't line up with the direction of the conversation, but I'd love to hear what people think of this proposal.
I believe the heart of the issue here is having the ability to validate string and number literals. This is a slightly different take, but here is a proposal for an intrinsic utility type which could provide that functionality.
(Note: One other advantage that this has over template literals by themselves is the ability to provide custom error messages.)
Features
- Supports generating regex using template literals (see example Email generic type)
- Allows constraint definition for number and/or string
- Custom error messages for one or more valdiation patterns
Example
// Utility Definition
/**
* Actual name TBD
* @param Regex - String or template literal for regex (in the same format as new RegExp(`<Regex>`))
* @param Flags - Optionally, provide regex flags (as new RegExp('', '<flags>');
* @param Constraint - Optionally, define initial constraint for value (cannot be a literal)
* @param ErrorMessage - Optionally, define an Message to display to make the error more understandable
*/
type Validated<Regex extends string, Flags extends string = '', Constraint extends string | number = string | number, ErrorMessage extends string = never> = {
intrinsic;
}
// ----
// Example Usage
// Intentionally simple contrivance, for demo purposes
type Email<Domain extends string = '\\S+\\.\\S+', Message = 'Invalid email address format!'> =
Validated<`\\S+@${Domain}`, '', string, Message>
// Validate against default email pattern + specific domain (intersection applies both validators)
type InternalEmail = Email & Email<'mycompany\\.com', 'Must be company email address!'>
// Example Implementation
let email: InternalEmail;
email = 'bad [email protected]' as const; // Fails with "Validation Error: Invalid email address format!"
email = '[email protected]' as const; // Fails with "Validation Error: Must be company email address!"
email = 3; // Fails with "Validation Error: Must be string literal!" (Because Constraint doesn't match)
Notes
- Union of Validated means validate against any - functions like Array.some() in terms of logic
- Need to determine diagnostic here. Possibly simple single message like, "Did not match any of the possible validation schema"
- Intersection means validate against all
- Should probably check against all regardless of failure of one, in order to add diagnostics for each failure (except in the case where Constraint is violated, in which case, only one diagnostic is added)
- Resolved constraint is the widest union of the constraints of its constituents
- All members of intersection must be type generated by Validated utility
- Defining a Constraint allows it to bail out early if the wrong type of literal is provided. It also allows configuring the resolved type.
- It's possible that
as constshould be implied during assignment. Interested in hearing if there are any downsides. - Need to determine how to handle if Regex param is a union of strings, presumably, we'd simply treat as an array of regex with the same error message and flags
Please discuss, and let me know of any problems or suggestions. If people see value to this, I'll write the PR.
For the compiler folks
Initial thoughts:
// Utility produces this
interface ValidatedLiteralType {
constraint: Type /* string | number, or one of the two */
regex: RegExp[] /* Array of compiled regex */
errorMessage?: StringLiteralType
}
The above proposal has good ideas in mind, but similar to some other discussions in this thread and the one prior, it seems to fall on the very verbose side.
type InternalEmail = Email & Email<"literal", ...>;
Comparing this to the existing literal value syntax, the additional intersection seems redundant.
type Foo = string & "literal"; // same as type Foo = "literal";
Likewise for the syntax, this comment by Ihor in the previous thread shows different use cases with the regular regex syntax which already covers both disambiguation and flags.
type CssColor = /^#([0-9a-f]{3}|[0-9a-f]{6})$/i;
Perhaps I'm missing something, currently I don't see what the generic adds over this. No other types currently support custom error messages natively out of the box (though there are workarounds you can use), so that would probably need to be a separate proposal by itself.
Thanks for the reply! There are several significant differences:
- Supports generating regex using template literals (see example Email generic type)
- Allows constraint definition for number and/or string
- Custom error message
- Provides for multiple regex patterns within a single type (from an internal compiler perspective)
These are strong distinctions, and ones which I believe have a bit of advantage over what you've mentioned. I think 1 is the most pronounced in terms of advantage.
4 is good for overall compiler performance. Not entirely sure on it, however, it could open the door for making DRY composite validated types, if you have string literals in regex format stored in separate types and you want re-use them across different validators with a single message.
Regarding 2 (constraint), for example, in the proposal:
type CssColor = /^#([0-9a-f]{3}|[0-9a-f]{6})$/i;
What is the actual constraint of CssColor? I assume that this pattern proposal must match a string constraint.
Consider:
type ThreeDigitCode = /^\d{3}$/
// Would this work? If it did, what resolved constraint would code be? In other words, would it be treated as number?
// Technically, if it supports string | number, it shouldn't be narrowed from assignment, so you'd be left with string | number
const code:ThreeDigitCode = 345 as const;
As for error messages in a different proposal, I actually believe that this helps greatly improve the strength and value of the proposal. Regex validation without a discernable message is going to prove frustrating for users. Especially when they are implemented in another library or piece of code that you've not personally written.
In terms of verbosity, it doesn't seem so bad to me, and the ability to provide the extra parameters seems worth it.
type ThreeDigitCode = Validated<'^\\d{3}$', '', number>
I suppose it's also worth mentioning:
- If it's green-lit, I'll actually write it 😅
Comparing this to the existing literal value syntax, the additional intersection seems redundant.
I missed this comment. I used the intersection to demonstrate using multiple validators and the power of generics with template literal support. See the Example Implementation in my demo code and note:
- Intersection means validate against all
This allows a specific error message based on which condition is violated
So one could implement an entire custom JSON schema validator in TypeScript? Interesting..
@arcanis it really sounds like you want nominal types (#202), since even if regex types existed, you'd still want the library consumer to go through the validator functions?
Frankly, even if it only worked with literal types I'd be fine with that. We already have nominal types (of sort) by using tagged strings. Our problem is more: "how can we accept literals as input", with an optional "and validate them".
Even something as simple as:
type PortablePath = TaggedPortablePath | literal_string;
That would still be better since at least we wouldn't have to write as PortablePath everywhere we use literals (which is a lot, especially inside our tests). Of course the best would be to also validate them:
type PortablePath = TaggedPortablePath | literal_string(/^[^/]*$/);
But that is secondary compared to express types specifically targeting literals (because being a literal somewhat encodes that the user intends to pass this value, so checking is less important than arbitrary values - even if it would certainly be better to have both).
As for @nonara's proposal, it sounds like exactly what we'd need, both for literals and validation. I don't mind much about verbosity, since most of it would be abstracted in intermediary types anyway. The as const would be a bit annoying though - is it necessary? With the template string improvements in 4.3, shouldn't TS preserve the string type as static anyway?
The
as constwould be a bit annoying though - is it necessary?
Probably not necessary, unless anyone can provide reason for why it should be.
To me, i18n is a big reason to avoid custom error messages, at least until TypeScript adds some native consistent way to internationalise those for users of other languages.
@nonara Regarding constraints, personally I would expect regex validated literal types to always be strings. That's where the proposal originally started out (and why I'm following it), but that is highly subjective and I can see some arguments for the other side too.
The reason why I personally feel this way is the following. Natively, Javascript only supports regex on strings. That can be worked around in one way or another if you'd like, but since Javascript is the underlying language for Typescript, matching its intuition can lower the number of foot-guns.
In addition to that, adding regex support for numbers creates a considerable amount of ambiguity that simply didn't exist before. A good example is non-decimal bases. Is const foo: NumberLiteral</\d{3}/> = 0b1111; valid? Would it be possible to only allow hex literals in a context where that makes sense? Or do you want to match whatever the number evaluates to instead? Likewise for floating-point errors, would you expect const foo: NumberLiteral</0.3/> = 0.1 + 0.2; to be an error or not?
Without taking a side on any of those questions, I hope you can see that numbers require far more consideration than boring old strings in this regard. Regex on strings is already hard problem, but at least it's a fairly well-known problem, and that's why I'd prefer to have that in type checking.
i18n is a big reason to avoid custom error messages
I hear you. However, something to consider. In the event it fails:
Without custom message:
Validation failed for YourType (in your language)
With custom message:
Validation failed for YourType: (in your language) + <Custom message> (single language)
In these scenarios, you lose nothing with the latter, as i18n translation is provided for base message. You do, however, gain some information. With respect, that argument is like prescribing not adding JSDoc documentation or comments due to lack of i18n. It's better to have information which may be marginally less than ideal in some scenarios than none at all.
If the proposal entirely replaced the base message, I'd agree, but given that it simply adds information, I don't see this being a negative.
Beyond that, i18n would actually still be possible if setup properly.
Don't get me wrong, I actually like the syntax and I feel that it's more TypeScript-ish than the current proposal, which still confuses me to some degree.