sdk
sdk copied to clipboard
(Dart.next) RegExp should support a `const` constructor
Too often in code reviews (just saw @lrhn's code review) we are instructing developers to avoid creating new RegExp('...') as a class-level or local field (I've even seen it in tight-loops!). Can the RegExp class support a const constructor that is canonicalized?
assert(identical(const RegExp('a-z'), const RegExp('a-z'));
(Even if the implementation for the VM is not-so, we'd like this in dart2js badly)
/cc @rakudrama
No, sadly it can't support that.
The problem is that the RegExp constructor parses the string and throws if it isn't valid.
That's not something a const constructor can do. If instead the constructor only stores the string, and using the RegExp parses it, we have the problem that a const object can't update its state, so we have to parse it every time we use the RegExp (or use an Expando to store the extra state, which is an option, but not a very nice one for something that's not a big problem).
RegExp originally had a const constructor, but it was cheating and not really being const, which was a problem for some implementations, so we fixed the glitch.
What would solve this problem is C-like local static variables. Then you could do:
String theFoo(String x) {
static re = new RegExp("flimflamflupper");
if (re.hasMatch(x)) return "Yey!";
return "Nay!";
}
and the re variable initializer would only be executed once, the first time the function is executed.
I would support that :)
It's equivalent to a lazily initialized global/static variable (well, almost, the initializer expression would be evaluated in the function's scope the one time it is evaluated).
We could at least make the cost of repeated regexps cheaper. I did a small experiment and added a one-element cache, and my small benchmark improved from 3secs to 1.5secs. (Moving the regexp out of the loop yields 1sec).
realRun(list) {
var counter = 0;
for (String str in list) {
var re = new RegExp(".*33.*");
var match = re.firstMatch(str);
if (match != null) counter++;
}
return counter;
}
run(list) {
try {
return realRun(list);
} catch(e) {
}
}
main(args) {
var list = [];
for (int i = 0; i < 10000; i++) {
list.add(i.toString());
}
var sw = new Stopwatch()..start();
for (int i = 0; i < 1000; i++) { run(list); }
print(sw.elapsedMilliseconds);
}
$ git diff
diff --git a/sdk/lib/_internal/js_runtime/lib/regexp_helper.dart b/sdk/lib/_internal/js_runtime/lib/regexp_helper.dart
index 35a42d1c..379fcb3 100644
--- a/sdk/lib/_internal/js_runtime/lib/regexp_helper.dart
+++ b/sdk/lib/_internal/js_runtime/lib/regexp_helper.dart
@@ -79,12 +79,29 @@ class JSSyntaxRegExp implements RegExp {
bool get _isMultiLine => JS("bool", "#.multiline", _nativeRegExp);
bool get _isCaseSensitive => JS("bool", "!#.ignoreCase", _nativeRegExp);
+ static String lastSource = "";
+ static String lastModifiers = "";
+ static RegExp lastRegExp = JS('', r'new RegExp("", "")');
+
static makeNative(
String source, bool multiLine, bool caseSensitive, bool global) {
- checkString(source);
String m = multiLine == true ? 'm' : '';
String i = caseSensitive == true ? '' : 'i';
String g = global ? 'g' : '';
+ String modifiers = "$m$i$g";
+
+ if (lastSource == source && lastModifiers == modifiers) {
+ return lastRegExp;
+ }
+ var result = makeCheckedNative(source, modifiers);
+ lastSource = source;
+ lastModifiers = modifiers;
+ lastRegExp = result;
+ return result;
+ }
+
+ static makeCheckedNative(String source, String modifiers) {
+ checkString(source);
// We're using the JavaScript's try catch instead of the Dart one to avoid
// dragging in Dart runtime support just because of using RegExp.
var regexp = JS('',
@@ -95,8 +112,8 @@ class JSSyntaxRegExp implements RegExp {
} catch (e) {
return e;
}
- })(#, # + # + #)''',
- source, m, i, g);
+ })(#, #)''',
+ source, modifiers);
if (JS('bool', '# instanceof RegExp', regexp)) return regexp;
// The returned value is the JavaScript exception. Turn it into a
// Dart exception.
/cc @sigmundch
@lrhn @floitschG - apologies that I don't have the full background, but have we considered letting the compiler fail with an error in cases where a const RegExp parsing would throw? It means duplicating the parsing/error checking of regexp in the compiler, but it's probably doable.
@rakudrama - could we treat new RegExp(constant) as a value we can GVN? In other words, in the example from Florian above, could we reach a point where we can do code motion and get the regexp out of the loop?
I think we should have RegExp literals. They are compact and can be compiled ahead of time. A compiler implementation can pattern match constructors and optimize them (with unpredictable reliability), but a literal gives a language level guarantee.
@lrhn static should be an expression modifier: static e. It reduces cognitive overload by putting the value next to its use. This relieves you of the hardest part: choosing a name. It is syntax for introducing a new name and lazy initializer in the nearest static scope (class or top level).
if (str.contains(static new RegExp('*.33.*'))) ...
var re = static new RegExp('hipp*o'); // I insist on a name!
@sigmundch We could try GVN but my hopes are not high - you can only hoist a throwing operation out of a loop if it is the first observable effect.
@floitschG Interesting experiment. There is a similar implementation neutral experiment where you import "const_regexp.dart" and it shadows core.RegExp and uses an expando to do the lazy init of the delegate. There will be some difficulty in the runtime (e.g. String.replaceAll(Pattern p, String replacement) special-cases p is RegExp.) A language feature that would help with this is a separate namepace for const and new constructors, so new const_regexp.RegExp is a factory redirecting to core.RegExp.
It's not a problem that String.replaceAll special-cases RegExp, you can just implement RegExp from dart:core. Example: https://dartpad.dartlang.org/3135557b3d3f6d000e6f9163709f6817
I'm personally opposed to baking RegExp grammar into the language grammar. It locks you into one specific RegExp dialect without really giving you anything except static checking. Adding static checking of RegExp constructor calls with String literals should be fairly easy for any source processing tool, the language doesn't need to do it. Dart has raw strings, which makes RegExp writing much easier because you don't have to escape backslashes, so building RegExp literals into the language isn't as big an advantage as it is in, e.g., JavaScript.
The actualy problem that needs to be solved here is a way to only create one RegExp, even if the constructor is called more than once. Static expressions or static declarations could do that.
The workaround is to declare the regexp somewhere else, or use a caching factory function instead of new RegExp.
RegExp doesn't do anything with side-effects, it's just doing some precomputation... exactly the kind of thing that would be perfect for a constexpr... if C++ can do it, why not Dart? :-)
Because Dart's const is not C++'s const. The accidental syntax similarity should not be taken to mean anything at all.
The C++ const is about not having side-effects. It allows some optimizations of the runtime execution (typically moving the computation before or after other computations).
Dart's const is about being compile-time computable. We cannot write Dart code that validates the RegExp grammar without using some constructs that are not allowed in const code (e.g., loops).
Obviously we could special-case the RegExp constructor in the compiler and check anyway, but we have chosen not to do that (any more). It's a feature that the analyzer might want to consider, but it's not something we're going to put into the language.
@lrhn it's not about checking, it's about achieving RegExp caching without manually caching it. Consider code like:
static final reFoo = new RegExp("...");
foo1(s) {
reFoo.match(s);
}
foo2(s) {
const RegExp("...").match(s);
}
Which of these two is easier to read? I would argue that foo2 is easier because regular expression is brought right next to the place which uses it.
Dart const expressions are severely limited to the point that they often can't be used for anything useful in compile time - contradicting their raison d'être to make things compile-time computable. We should learn from languages like D which just allow anything to run in compile time.
I don't disagree with @mraleph. The Dart const design is rather limited. It wasn't designed to make "things compile-time computable" in general, but to carve out a small subset of things that were so simple that making them compile-time computable was trivial. Within that design, we can't do the RegExp constructor because it actually checks the syntax. It is about checking - because const is about more than just ensuring that something is evaluated only once.
The const in Dart has three separate effects - compile-time/single computation, global canonicalization and immutability. In different cases it's being used for just one or two of these. This case is about the "compile-time/single computation". We don't really need (program wide) canonicalization, just that we don't compute the same value more than once at this particular point. If we had a feature that did just that, like the suggested static expr expression, there wouldn't be any need to make the RegExp constructor const, just (static new RegExp(r"...")).match(s).
(As a curiosity, JavaScript originally had a clause that allowed implementations to evaluate a RegExp literal only once and reuse the value. That was dropped in later versions because JavaScript RegExps are not actually immutable).
Dart only has const to address all these things, so that's why every problem is being expressed in terms of how const can solve it if we just extend it a bit more. I'm not generally inclined to extend const (because it's a never ending story unless we do go for "run anything at compile time" - and boy, and I'm going to do compile-time computed Uint8Lists then!).
I would prefer to look for alternatives that directly solve the actual problems, not ways to make more things fit into the const solution.
Another solution is to guarantee that the implementation optimizes simple uses of a new RegExp(string-literal) well enough, so that it doesn't matter that you're technically supposed to create a new object.
We cannot write Dart code that validates the RegExp grammar without using some constructs that are not allowed in const code (e.g., loops).
We can first allow loops in const code, which is something we can do if we only allow loops that have no side-effects, and then we can write the Dart code to validate the RegExp grammar.
Dart's const is about being compile-time computable.
Loops are compile-time computable.
The Dart const design is rather limited.
The whole point of issues like this one and https://github.com/dart-lang/sdk/issues/29276 (which both probably depend on https://github.com/dart-lang/sdk/issues/29277) is to change that design. :-)
I would prefer to look for alternatives that directly solve the actual problems
That would be fine too. :-)
Loops were an example. I'm pretty sure a RegExp parser will require recursive function calls too. At that point, there isn't really anything left. Also, if you think of non-termination as a "side"-effect, loops and recursion both have that, even if their bodies seem safe. So, we're back to either keeping the const subset small or not having a subset - allowing everything as a compile-time evaluated code, with the obvious problem of not being able to determine whether compilation terminates.
Well there are definitely things that shouldn't be allowed, e.g. anything that does system calls.
I agree that allowing features that result in allowing non-termination is unsatisfactory. On the other hand, so is the fact that there's tons of classes that we can't make const just because they happen to take a RegExp, or because in debug mode they try to do an assert that involves a loop, or whatnot.
Maybe I misunderstand the non termination bit but wouldn't the compile time equivalent of a stack overflow error fit ?
Stack overflows at compile-time would definitely be compile-time errors.
There would probably also have to be a "timeout" error if computation took too long, because a loop won't cause stack overflows. The problem is that you have to set a timeout duration that won't actually prevent compiling big correct programs. Determining whether a loop runs forever, or it's just slow, is an undecidable problem, so it will be a heuristic, and it will likely allow significant time (because some real computations might need it). It's really back to the same issue that caused browser to show an "This web page is unresponsive, wait or cancel?" prompt. (Man, those were annoying!)
Now, coming back to this issue, we may actually want to reconsider having a const constructor on RegExp.
It would be a different one from the eager constructor which checks the RegExp syntax up-front, say const RegExp.lazy, and it would just forward to a helper class like https://dartpad.dartlang.org/3135557b3d3f6d000e6f9163709f6817.
The reason to want that is that a const invocation of that constructor can be handled specially by AoT compilers, if provided with a known constant string. The compiler can actually compile the underlying RegExp into native code at compile-time (or just remember which FormatException to throw if the RegExp is invalid).
That said, we can do the same optimization for new RegExp(constantString), so that isn't new. We just can't use the existing constructor in a constant context.
Determining whether a loop runs forever, or it's just slow, is an undecidable problem, so it will be a heuristic, and it will likely allow significant time (because some real computations might need it).
In most cases giving every const call a second to compile should be enough. Additionally, the 1 second could be manually configurable by a parameter that's passed to the compiler so that the few cases where a programmer needs more than one second, they can tell the compiler.
It would be nice if match/hasMatch could also be used in const contexts, such as the following:
lib/env.dart
class Assert {
const Assert(bool condition, [Object? message]) : assert(condition, message);
}
@Assert(bool.hasEnvironment('HTTP_PORT'))
@Assert(RegExp(r'^\d+$').hasMatch(httpPort)) // <--- RegExp Here
const httpPort = String.fromEnvironment('HTTP_PORT');
debug.env
HTTP_PORT=80
For more context on this concept https://github.com/dart-lang/build/issues/3574#issuecomment-2196523217.
The const sub-language does not allow calling methods or members in general.
We could allow some functions like we allow String.length. A constant RegExp should be something that we can emulate at compile-time by having the same RegExp implementation available in the compiler, since the object is immutable and methods have no side effects.
(We'd have to make sure there are no platform differences between RegExp implementations, otherwise compilers/analyzer won't necessarily agree with the runtime behavior.)
That said, there are no current plans to make the RegExp constructor contant. The current behavior does input validation which isn't possible in a const constructor, so the constructor wouldn't actually be a real constant constructor. It would just a constructor which could be called using const, triggering some specialized compiler behavior to simulate the current runtime behavior at compile-time.
That's something we otherwise try to avoid. The Symbol constructor is no longer validating for precisely that reason.
There are a bazillion things that could be nice to be able to do at compile-time. We try to not add arbitrary exceptions and special cases that need individual compiler support to the constant language, to keep it simple and consistently implementable accross all the tools.
In this case, this could work:
@Assert(int.fromEnvironment('HTTP_PORT', defaultValue: -1) >= 0, 'Invalid port: $httpPort')
The const sub-language does not allow calling methods or members in general.
@lrhn Is it hard or any obstacle to declare functions/methods to behave like const constructors (or is it too confusing)?
Something like this:
// This syntax might colide with "return a constant boolean", allowing no-const operations.
// Here we are defining a function that can be called (or not) on const contexts, so only const operations are allowed
bool const equal(String foo, String bar) {
// here we might be in a const context, so we can only run const operations
// and const functions that can be resolved at compile time, like this one
noOp(); // we can call this because is `const`
return foo == bar; // allowed as a const operation
}
void const noOp() {}
void main(List<String> args) {
final notConstFoo = equal('bar', 'test'); // executing in a non const context, fine
final notConstFoo2 = equal(args.first, 'test'); // executing in a non const context, fine
const errorNotConst = equal(args.first, 'test'); // error, can't execute this as const since input is not const
}
const input = 'const';
const result = equal(input, 'test');
Current const constructor behavior for comparison:
class Foo {
// In const constructors we can only call const operations like the function `equals`
// const Foo(String bar) : bar = bar.toUpperCase() ; // error
const Foo(this.bar); // here we can only call const operations like the function `equals`
final bar;
}
void main(List<String> args) {
final notConstFoo = Foo('bar'); // executing in a non const context, fine
final notConstFoo2 = Foo(args.first); // executing in a non const context, fine
const errorNotConst = Foo(args.first); // error, can't execute this as const since bar is not const
}
const input = 'const';
const constFoo = Foo(input); // fine
We could also prohibit calling these functions on non-const contexts.
Edit: Change strawman syntax const returnType functionName() to returnType const functionName(). This avoid ambiguity with "return a constant value", it's the function that is const.
C++ has gone down that road of compile-time function evaluation with its constexpr (the equivalent of Dart's const). It started with a lot of restrictions and gradually became more powerful. There's also a consteval keyword to force an expression to be evaluated at compile time (even if the result is used in a runtime context, e.g. stored in a mutable variable).
to behave like
constconstructors
One argument against const-functions is that they can't have any side-effects or mutable state. Or statements, really.
That means that anything other than the return expression is useless, so we might as well limit it to:
const int add(int a, int b) => a + b;
No loops, don't want divergent compile-time evaluaton.
The one thing it may be able to do is call another const function. But maybe not itself, we don't want unbounded recursion either, that's as good as a loop.
With extension types, you can write "functions" that work like constructors, because they are constructors:
extension type const strDefault(String _) implements String {
const strDefault(String? input, [String defaultValue = ""]) => this._(input ?? defaultValue);
}
...
const String s = strDefault(maybeString, "or not");
The invocation of the constructor must be const, not potentially constant, so the arguments must be constant.
It's remarkably little you can do with constant constructors, that you can't just write directly.
What about a sensible recursion limit that is tweakable ? Is this an issue in c++ ?