Regex101 icon indicating copy to clipboard operation
Regex101 copied to clipboard

Rust

Open JonathanTroyer opened this issue 6 years ago • 19 comments

Flavor Request

The syntax is similar to Perl, but I feel it has enough differences to justify a different flavor, especially when one considers the massive popularity of ripgrep (which is used by VSCode!) and the growth of Rust.

JonathanTroyer avatar Sep 30 '19 16:09 JonathanTroyer

Support this, it would be a great feature.

ksandvik avatar Sep 30 '19 21:09 ksandvik

I asked the author of Rust Regex library @BurntSushi to help bringing Rust flavor to regex101.com. He said:

Go's regex engine is pretty similar. The main differences are that this crate has much better Unicode support and supports more advanced character class notation (i.e., intersection, subtraction and symmetric difference).

So I think, it should be possible to just copy the Go flavor and give it the name Rust and this should be enough for now. The author @BurntSushi is prepared to help where he can. This is the issue on the Rust Regex repository: https://github.com/rust-lang/regex/issues/700#issuecomment-667065026

I created a playground gist that can be compiled and run online to check special cases where the flavors can differ: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=13975bff3879f843dc80d338091555b6

bestia-dev avatar Jul 31 '20 14:07 bestia-dev

So I think, it should be possible to just copy the Go flavor and give it the name Rust and this should be enough for now.

Please don't do this. It's one thing to say, "Go is very similar to Rust, so using that as a stopgap for most cases on ASCII text will work fine." But please don't officially label it as Rust because users will ultimately get quite confused when it differs from the actual Rust implementation. :-)

BurntSushi avatar Jul 31 '20 14:07 BurntSushi

I wanted to say more precisely: Rust Regex flavor is very similar to Go regex flavor. I would like to ask the <regex101.com> if they need just some regex rules to create a new flavor? And than they use their regex engine with different configurations? Or they need a real functioning library written in the Rust language? That we can help to write.
Rust is also great to compile to webassembly/wasm if that is needed.

bestia-dev avatar Jul 31 '20 14:07 bestia-dev

@BurntSushi, I have a question about "flags/modifiers". Must they be a part of the Regular Expression in Rust Regex like (?m) or (?i)... It looks that other libraries can change these in some configuration external to the reg.expression. And what regex delimiters are in use in Rust? Some engines can use different delimiters (see image).

image image

bestia-dev avatar Aug 01 '20 06:08 bestia-dev

I think special delimiters are not used in Rust. The regular expression is just a String. The normal String delimiter in Rust is quote ". But for Regex it should be better to use the Raw String syntax like this: let s = r#"content"#; With multi-character asymetric delimiters r#" and "#. That way there is no need to escape the quote " and the backslash \ symbols inside the Raw String. They don't have any special meaning inside the Raw String syntax.

bestia-dev avatar Aug 01 '20 07:08 bestia-dev

I have a question about "flags/modifiers". Must they be a part of the Regular Expression in Rust Regex like (?m) or (?i)... It looks that other libraries can change these in some configuration external to the reg.expression. And what regex delimiters are in use in Rust? Some engines can use different delimiters (see image).

These questions seem off topic for this thread, but they can be readily answered by the docs:

BurntSushi avatar Aug 01 '20 12:08 BurntSushi

Could the Rust regex engine be compiled into WASM and used on the website? If so, could someone create a PoC? That would speed up the process of actually getting this implemented.

firasdib avatar Feb 13 '21 14:02 firasdib

I've created a very rudimentary proof of concept following the wasm-bindgen guide. To test, clone the repo, then run npm install followed by npm run serve.

JonathanTroyer avatar Feb 13 '21 20:02 JonathanTroyer

@JonathanTroyer Thank you! Mind including a readme so I know how to run, build, etc?

firasdib avatar Feb 14 '21 18:02 firasdib

@JonathanTroyer Thank you! Mind including a readme so I know how to run, build, etc?

Done. Sorry for overlooking it, and thanks for working on this! Happy to help more in the future.

JonathanTroyer avatar Feb 14 '21 18:02 JonathanTroyer

@JonathanTroyer Thanks, I'll have a look this weekend most likely. Does this bundle the Rust regex engine into WASM, or are they just native bindings, relying on the user to have Rust installed locally?

firasdib avatar Feb 15 '21 15:02 firasdib

No bindings, it's fully compiled to WASM. I've got it hosted on Netlify for quick testing.

JonathanTroyer avatar Feb 15 '21 15:02 JonathanTroyer

Sweet! What size is it?

firasdib avatar Feb 15 '21 16:02 firasdib

In development mode with no optimizations, about 3MB everything included. The demo does not use all the features of Rust's regex package, so that size may grow depending on the final usage.

JonathanTroyer avatar Feb 15 '21 20:02 JonathanTroyer

@JonathanTroyer That is quite large, ideally we'd want it down to <500kb. I have followed their optimization guide, but I am unable to get index_bg.wasm under 1.1mb, and wasm_regex.wasm to below 610kb. Have you had any luck?

firasdib avatar Feb 16 '21 19:02 firasdib

Untill they make it fully no_std + alloc the size will be likely around that probably

cdecompilador avatar Feb 18 '22 02:02 cdecompilador

@firasdib How do the other implementations work? I'd assume there's less of a size restriction if you don't have to serve the binaries.

Assuming it is just something like a CLI program that runs locally, would you be able to specify the required interface? If so, somebody here could likely quickly build a working implementation.

tgross35 avatar Jul 26 '22 06:07 tgross35

@tgross35 They are compiled to web assembly and interfaced through Javascript. The binaries will be downloaded from my server, so for the sake of both me and the users, they should be as small as possible.

firasdib avatar Jul 26 '22 11:07 firasdib

I also made a PWA progressive web app with Wasm/Webassembly compiled from Rust. So it uses exactly the regex crate. https://bestia.dev/rust_regex_explanation_pwa/ https://github.com/bestia-dev/rust_regex_explanation_pwa

bestia-dev avatar Sep 23 '22 10:09 bestia-dev

That's pretty interesting @bestia-dev, what size of the wasm binaries were you able to get down to? I think that is the main crux of support here

tgross35 avatar Sep 25 '22 19:09 tgross35

The rust_regex_explanation_pwa_bg.wasm file is 1MB. It sound like a big file for the web we know before wasm. But in fact this has to be treated more like an installation file. Once you install it, it remains in the cache of the browser for a long time. And subsequent use of the PWA does not download it any more. Just like an installed native app, just without the hassle to really think about the installation. The "installation" is automagic.

bestia-dev avatar Sep 28 '22 10:09 bestia-dev

@bestia-dev are you building the regex crate with the perf features disabled? That might help reduce binary size. Not sure though.

BurntSushi avatar Sep 28 '22 11:09 BurntSushi

I took @JonathanTroyer's small example and modified the Cargo.toml a little, and rebuilt std + panic on abort on nightly.

Building from https://github.com/akarras/wasm-regex 499031 Nov 18 15:23 wasm_regex.wasm Just under 500KB

The readme includes the exact wasm-pack command I used to create it.

akarras avatar Nov 18 '22 22:11 akarras

My bet is that you can disable some of the Unicode features too. Some are pretty arcane and not often used. I would recommend just using the following: unicode-bool, unicode-case, unicode-gencat, unicode-perl, unicode-script. In other words, disable unicode-age and unicode-segment. Probably not a huge win. If you wanted to go barebones, you could try just enabling unicode-case and unicode-perl.

BurntSushi avatar Nov 18 '22 23:11 BurntSushi

With @BurntSushi's suggestions, down to 445kb. I'm not sure what kind of API is needed, but I think that gives enough headroom to add a few things while staying under the <500kb goal.

akarras avatar Nov 18 '22 23:11 akarras

I wrote a quick manual json output and a replacer function to go with it https://github.com/tgross35/wasm-regex, my binary size is even smaller at 427kB. Newer versions maybe? I have npm LTS 8.19.2 and wasm-pack 0.10.3

image

@BurntSushi is there a good way to match up capture group numbers and names? It seems like you can iterate names .capture_names() or get a single named group with .name(), but I can't figure out how to iterate all groups and optionally get a name for each (figure this might be needed to produce the regex101 output)

tgross35 avatar Nov 19 '22 01:11 tgross35

is there a good way to match up capture group numbers and names? It seems like you can iterate names .capture_names() or get a single named group with .name(), but I can't figure out how to iterate all groups and optionally get a name for each (figure this might be needed to produce the regex101 output)

Regex::capture_names is right. It yields unnamed capturing groups too. From the docs of CaptureNames:

An iterator over the names of all possible captures.

None indicates an unnamed capture; the first element (capture 0, the whole matched region) is always unnamed.

'r is the lifetime of the compiled regular expression.

BurntSushi avatar Nov 19 '22 01:11 BurntSushi

Awesome, got something working with named groups that is probably suitable enough for the site. Binary size is only 433kB with using serde.

@firasdib let us know exactly output shema you want if you'd like me to tweak it for you

image

tgross35 avatar Nov 19 '22 01:11 tgross35

wasm-opt -Oz -o out.wasm in.wasm reduces it a bit further for me. You can get wasm-pack to run it for you by adding this to the Cargo.toml:

[package.metadata.wasm-pack.profile.release]
wasm-opt = ["-Oz"]

Though wasm-pack seems to use a slightly older version by default, using a newer version gives slightly smaller results for me. If you have it in the PATH, wasm-pack will use that version instead.

benediktwerner avatar Nov 19 '22 02:11 benediktwerner