wirefilter icon indicating copy to clipboard operation
wirefilter copied to clipboard

Engine: Regex: Avoid compiling and storing the same regex multiple times

Open arilou opened this issue 1 year ago • 10 comments

Previously, the engine was compiling the same regex multiple times during execution. This redundant compilation process led to unnecessary performance overhead and increased resource consumption.

With this change, we store the compiled regex in a HashSet. If the same regex is used more than once, the engine reuses the previously compiled instance.

arilou avatar Dec 19 '24 08:12 arilou

Hello and thank you for your proposal.

Unfortunately, we aren't going to merge this approach because we would prefer consumers of wirefilter to be in control of the cache they are using.

But the use case your perfectly reasonable :) We have been working internally on a huge refactor which would only parse regex patterns using regex-syntax during filter parsing. Regex would then be compiled during filter compilation which is customizable using the Compiler trait. This is our current plan but we don't have an ETA at the moment. It requires some preliminary steps:

  1. Storing spans for each AST node somehow (we have that part 90% done)
  2. Make the filter compilation fallible, so that regex compilation can return an error using the span of the regex pattern
  3. Improve the compiler trait to support customization of the regex object

marmeladema avatar Dec 19 '24 08:12 marmeladema

Thank you so much for the fast reply! I started going over all the changes in this last sync, I noticed you added wildcard and wildcard strict, is there perhaps some document that summarizes some of the major changes?

As for this PR, I understand and looking forward to work with what you have described. I was wondering if in the mean while if ill put this change under a feature flag will help? Something along the lines of this diff: https://github.com/cloudflare/wirefilter/compare/master...arilou:wirefilter:new_wiz

There are are few other changes we were hoping to present to you, we will be working on them in the coming week and send you a PR to review.

arilou avatar Dec 19 '24 09:12 arilou

I was wondering if in the mean while if ill put this change under a feature flag will help?

I am very sorry about my answer, but no, we aren't going to merge changes that we aren't going to support and use internally.

There are are few other changes we were hoping to present to you, we will be working on them in the coming week and send you a PR to review.

Probably best to open issues about what you want/need first so that we can discuss ahead of time what is the best way forward.

marmeladema avatar Dec 19 '24 09:12 marmeladema

Previously, the engine was compiling the same regex multiple times during execution.

I don't think it does. As long as you store and reuse the Filter instance, the compiled regex should already be cached as part of it.

RReverser avatar Dec 19 '24 13:12 RReverser

That's true but it's fairly common to use the same regex in different and unrelated filters.

marmeladema avatar Dec 19 '24 13:12 marmeladema

Ah yeah, in those scenarios caching is definitely up to the user.

RReverser avatar Dec 19 '24 13:12 RReverser

In our use cases we have many different rules each one is a filter, but from your conversation it sounds like I might be using Wirefilter wrong.

Is not it 1 rule to 1 filter? I mean one could do an OR between all the different conditions to build a single giant filter, but then how would you know which rules were triggered, and the moment a single "sub-rule" is triggered the evaluation would stop.

Am I missing something in the way I currently work with Wirefilter?

arilou avatar Dec 19 '24 13:12 arilou

Regardless a small issue I ran into few days ago is the signature of get_field_value, which uses the wrong lifetime (Not sending a PR since you guys are working internally on your git anyway) but the change I'm suggesting is:

diff --git a/engine/src/execution_context.rs b/engine/src/execution_context.rs
index 32a1907..a226cbe 100644
--- a/engine/src/execution_context.rs
+++ b/engine/src/execution_context.rs
@@ -132,7 +132,7 @@ impl<'e, U> ExecutionContext<'e, U> {
     }

     /// Get the value of a field.
-    pub fn get_field_value<'s>(&self, field: Field<'s>) -> Option<&LhsValue<'_>> {
+    pub fn get_field_value<'s>(&self, field: Field<'s>) -> Option<&LhsValue<'e>> {
         assert!(self.scheme() == field.scheme());

         self.values[field.index()].as_ref()

This allows doing stuff like "copying" a field to another field

arilou avatar Dec 19 '24 14:12 arilou

We ran into some performance issue in our end to end tests with the new wirefilter, I was wondering if it would be possible for you guys to publish a branch with all the git commits rather than a single squashed one. so we can run a bisec easier to try and spot what is causing the issue.

Thanks and happy new year!

arilou avatar Jan 02 '25 17:01 arilou

Unfortunately no, it's too tedious to analyze each commit and verify what information is contained in the commit and if it's publicly shareable.

Alternatively, if you can reproduce the problem with some standalone code / script etc, we can probably run it internally and try to figure out the problem.

That being said, if it's related to regexes, Wirefilter is probably not the problem since we are just a very simple consumer of it. The regex crate has been substantially rewritten in the last year and a half and we have found some performance regressions with it but never had time to reproduce in isolation in publicly shareable test cases.

marmeladema avatar Jan 02 '25 17:01 marmeladema