regexp-tree
regexp-tree copied to clipboard
Optimizer or minifier ?
This is more of a question than an issue.
Is the Optimizer more an optimizer or a minifier? Or both at the same time?
Currently, the Optimizer does some "optimizations" that actually increase the length of the regexp, like:
/aa/
(2) -> /a{2}/
(4)
/a|b/
(3) -> /[ab]/
(4)
/aa+/
(3) -> /a{2,}/
(5)
If it's supposed to be a minifier, we should try to avoid those cases. (Note that this could be tricky sometimes, since an optimization might not seem worth it at first, but work well with following optimizations)
Also, if we really want minification, we could add additional transforms like:
find the shortest representation of a char:
/\u0010/
(6) -> /\x10/
(4)
/\u0000/
(6) -> /\0/
(2)
convert all coded chars to their printable equivalent (even outside of basic printable ASCII range):
/\u{dd}/u
(6) -> /Ý/
(1)
reorder class ranges in a way to avoid escaping:
/[\^a]/
(5) -> /[a^]/
(4)
/[,\-.]/
(6) -> /[,.-]/
(5)
and so on.
What's your opinion on this?
I intend to start working on actual options for the optimizer. The whitelist is a good start, but it's kind of verbose and does not really allow for "optional" transforms. I think an object of options with default values would be more appropriate (something like the compress options in UglifyJS).
@Golmote, all good points. Initially I was thinking in terms of "minifier", however, "optimizer" might also mean whether it's faster to parse/execute a specific construct.
The things with printable chars, and making codes simpler also makes sense, would be great to have.
Potentially we can provide also --minimize
, unless it'll be confusing what to use in which use-case: --optimize
or --minimize
. So we should keep in mind that it should be easier for an end user (making less decisions/choice in this case). At the same time, I like the idea of passing options, and using specific transforms for advanced users.
@DmitrySoshnikov I think the default options should lead to an output that follows those principles:
- Optimize and minimize as much as possible
- Don't (ever?) make the optimized regexp longer than the original
- Maintain readability (By readability, I mean avoiding conversion to unicode chars, avoid ranges that don't "feel" right like
Z-a
, etc.) - Don't change the semantics (Too bad for the \s optimization... changing the semantics would lead to hard-to-find bugs and we don't want standard end users to run into this. We could provide a
--loose
option to perform additional loose optimisations but I don't think it should be on by default)
That should keep the standard end user happy. And options could provide a more granular approach.
I'll slowly start working on this. I'll have more visibility once I dig into it, I guess.
EDIT: Also, I would stay with just --optimize
for now. You're right it might get confusing having both --optimize
and --minimize
.
@Golmote, yep, agreed; all good points! I am fully supportive here.
@DmitrySoshnikov Hi, I'm encountering issues while trying to implement this.
-
With our current
yargs
behaviour, options would technically be global, and not specific to the Optimizer. I'm not sure if it's an issue. If we wanted to be more specific, I think we should use commands instead of args (See https://github.com/yargs/yargs/blob/master/docs/advanced.md#commands) but I'm not sure we want to take this path. -
I started to implement three options for now:
--loose-meta-s
which allows for[ \n\r\t]
to be optimized as\s
,--loose-meta-w
which allows for[\da-zA-Z_]
to be optimized as\w
even withiu
flags set, and--loose
wich is a shortcut to set both the previous options at once. Now the question is: which component should be responsible for expanding the--loose
option to--loose-meta-s --loose-meta-w
? It could be the Optimizer, but since the options are global (as explained in my first point), it makes little sense. Yet I don't know if we want to expose a function for this in the globalregexpTree
object. -
Are you OK with the options replacing the whitelist for the Optimizer entirely? That means there should be an option to enable/disable each transform. Like
--lowercase-insensitive
,--merge-class-ranges
, etc. (we'll have to find a nice option name for each transform...) -
Maybe all the points above could be solved by specifying all options in a string, as an optional value for the
-o
arg. It's actually the way Uglify went, apparently. -
For the loose options, they must be sent to the transforms somehow. I originally thought it could be added as a parameter to
shouldRun
andinit
. This means adding anoptions
parameter toregexTree.optimize()
,optimizer.optimize()
,transform.transform()
andtraverse.traverse()
... But thetraverse
function already has anoptions
parameter, and I'm worried it will be confusing to have two options params here, one to configure the behaviour of the handlers, and one to configure the behaviour of the traversal...
Any advice on those points?
@Golmote, I actually was thinking about command-based interface, since we're starting to have options which may collide. I think this can be appropriate approach long term (would be great to somehow preserve backward compatibilities, and not break previous versions).
The explicit --loose-meta-s
, --loose-meta-w
, as well as a bunch of --lowercase-insensitive
, --merge-class-ranges
doesn't seem to me the best approach: we want to have a concise CLI interface, especially when a user will run --help
option (so going with commands in this case sounds like a good idea). If we go with --loose
though, I think only one option --loose
is enough which enables all loose transformations.
I like the idea of passing the whole "config" as a string. All modules can use this explicit config, and treat it semantically as needed (i.e. optimizer may have its own values there, compat-transpiler -- others). And in this case we can propagate this config somehow to all needed modules. Hopefully it won't be too confusing with options
existing in some modules already.
Or (better) we could pass the --options='{"foo": 10, "bar": 20}'
, and use it to set/override the options parameter in all those modules -- this approach will make it generic using from Node, and form CLI.
@DmitrySoshnikov The problem is that even if we can have per-command/per-module options, they would still end up being kinda global on the programmatic side.
Consider a --loose
option for the Optimizer (let's not think about any more granular control for now). On the programmatic side, the Optimizer will have to pass this option on to the Transform module, which in turn will have to pass it on to the Traverse module, so that it can be finally sent to all the handlers which ccould react accordingly.
Since the modules use each other, how can they still have options that are specific to each of them?
@Golmote, yeah, makes sense. Alternatively we could have some global "options registry", where keys are the module names (optimizer, transform, etc), and values are the actual options for them.
In this case transforms may just require this registry, and get needed options. I can also be just global Options
object for all the global shared options. It can track a module-level storage, which is initialized when then CLI starts.
This however might not be ideal for usage from Node, and in this case propagating the transform options might be a good idea. The setTransformOptions
may still manipulate the shared module, and transforms may just require it.
I just want to share a minor oddity I saw in my own code using a eslint plugin to "optimize regexps":
/\.(eot|woff|woff2|ttf|otf|svg|png|jpg|jpeg|jp2|jpx|jxr|gif|webp|mp4|mp3|ogg|pdf|html|ico)$/
to:
/\.(eot|wof{2}|wof{2}2|t{2}f|otf|svg|png|jpg|jpeg|jp2|jpx|jxr|gif|webp|mp4|mp3|og{2}|pdf|html|ico)$/
The result is neither better readable and should probably only be slightly higher in performance (I am not deep into regexp tech). At least using this "optimization" inside a eslint rule makes it pretty worthless. In this case readability trumps even minor perf optimizations.
Probably it would be useful to disable a few specific optimizations e.g. in the mentioned case it seems to be related to repeating characters.
This is the eslint plugin I am using: https://github.com/BrainMaestro/eslint-plugin-optimize-regex
@swernerx, good and valid point, thanks. Yeah, the optimizer module already can accept a whitelist of the needed transforms.
So potentially eslint-plugin-optimize-regex
could expose an ability to provide this whitelist in the eslint plugin option.
But I agree ttf
-> t{2}f
might be a bit too aggressive "optimization", maybe we can exclude optimizing single characters. What do you think, @Golmote?
In any way, it might be still good for the eslint plugin to expose this the ability to provide the whitelist, and allow users of the plugin to exclude non-needed optimizations.
Hi! I totally agree. I got a bit worried of adding the options thing, because of all the repercussions it had on the codebase last time I tried. I'm not even sure I'm ready to give it another try anytime soon.
I could definitely go check and try to remove some of the bad optimizations I mentionned in the first post (https://github.com/DmitrySoshnikov/regexp-tree/issues/135#issue-278042051), though!
@Golmote,
I could definitely go check and try to remove some of the bad optimizations I mentionned in the first post (#135 (comment)), though!
Sounds good to me, yeah, we can probably exclude the one @swernerx mentions with the single chars.