otp icon indicating copy to clipboard operation
otp copied to clipboard

Regular expression export/import

Open sverker opened this issue 6 months ago • 6 comments

Problem

Before OTP 28.0 it was possible to abuse the compiled format of regular expressions as returned by re:compile as if it was a serialized format to be imported into other Erlang node instances. This abuse happened to work as long as the underlying hardware architecture and PCRE version was not too incompatible. But it was unsafe as any unpleasant behavior could be the result of passing an incompatible compiled regular expression to re:run.

In OTP 28.0 the compiled format has changed to not expose the internals of PCRE but instead return a safe (magic) reference to the internal regex structures. A compiled regex is now safe but can only be used in the node instance that compiled it.

Solution

This PR introduces a supported safe way to export compiled regular expressions. The exported format is self-contained and can be stored off-node or sent to another nodes. If the importing node is compatible (architecture and PCRE version), then the compiled regex can be used directly with minimal overhead. If not compatible, then the regular expression will be recompiled from the original string and options which are included as a fallback in the exported format.

Usage

% Use 'export' option to re:compile
{ok, Exported} = re:compile(RegexString, [export | OtherOptions]),

then in a potentially other node do

Imported = re:import(Exported),

re:run(Subject, Imported),

Exported format

The exported format is opaque but look currently like this:

{re_exported_pattern, HeaderBin, OrigBin, OrigOpts, EncodedBin}

  • EncodedBin - binary containing the compiled regex as encoded by pcre2_serialize_encode()
  • HeaderBin - binary with some meta information including a CRC checksum over EncodedBin
  • OrigBin - original regular expression as a binary string
  • OrigOpts - options passed to re:compile/2.

Future optimization

For users that earlier generated Erlang code with compiled regular expressions as literals would now instead compile with option export and generate re:import(Literal) instead of just the literal. If done like that, the beam loader could be optimized to detect such calls to re:import with literals as arguments, evaluate the calls in load-time and replace them with just the returned compiled regular expression as a literal term.

sverker avatar Jun 18 '25 18:06 sverker

@josevalim What do you think about this?

sverker avatar Jun 18 '25 18:06 sverker

CT Test Results

    4 files    228 suites   1h 54m 13s ⏱️ 3 729 tests 3 626 ✅ 103 💤 0 ❌ 4 859 runs  4 730 ✅ 129 💤 0 ❌

Results for commit efd5ef0c.

:recycle: This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

github-actions[bot] avatar Jun 18 '25 18:06 github-actions[bot]

I believe this is fantastic and simplifies many of the issues we had to tackle in Elixir. Thank you.

It would be fantastic if this could be used from Erlang too. Perhaps a pass in the compiler will rewrite re:compile into re:import?

Also, do you see this making to 28.1 or would it be 29 only?

josevalim avatar Jun 18 '25 18:06 josevalim

The plan is to get this export/import functionality into 28.1. And then potentially do the loader optimization later maybe already in 28.2.

sverker avatar Jun 19 '25 09:06 sverker

@sverker making it part of 28.1 would help Elixir codebases migrate to latest OTP, so thank you.

I have one additional question: do you think it is reasonable for re:run to automatically import an exported regex? I am thinking about the multi-node scenario, where you would need to explicitly import messages across nodes (which could be arbitrarily nested), so having it just work is beneficial. Or are you worried about importing being expensive if we have to do it on every operation?

josevalim avatar Jun 19 '25 17:06 josevalim

I have one additional thought: what if the export is part of the existing tagged tuple? For example, you can add a new field to {re_pattern, _, _, _, _} that returns the export or the atom none. If exported, then you can transparently send it across nodes or run it locally with no performance cost. The receiving node can also run it transparently but it has the option of importing it to make sure it is optimised. What do you think would be the pros and cons of this approach?

josevalim avatar Jun 20 '25 12:06 josevalim

The "import" step was literally free but unsafe. It is now safe but not totally free. It has to

  1. Check the CRC checksum of the imported binary.
  2. Allocate memory for the compiled regex.
  3. Do the "decoding" which seems to be basically a memory copy operation in current PCRE2.

I did some measurements, and the import seems to be a least a factor 10 cheaper than compiling the corresponding expression. Compiling a large 20 kb regex took ~500μs while importing it took ~40μs.

Our idea was to keep the import as a separate step for performance reasons. At least to begin with. After all, the only reason to precompile regex is performance. If you don't care much about that, just send the regex across node instances uncompiled.

For example, if someone has existing generated code looking like this

choose_regex(foo) ->
    {re_pattern, ...};
choose_regex(bar) ->
    {re_pattern, ...}.

do_the_match(Subject, Mode) ->
    re:run(Subject, choose_regex(Mode)).

then the loader trick would probably not trigger as the regex argument to re:run is not a compile time literal.

If we keep the import separate, then the code generation could be changed simply by adding the export option to re:compile and re:import around the generated literals.

choose_regex(foo) ->
    re:import({re_exported_pattern, ...});
choose_regex(bar) ->
    re:import({re_exported_pattern, ...}).

do_the_match(Subject, Mode) ->
    re:run(Subject, choose_regex(Mode)).

The loader can detect the calls to re:import with literal arguments while the rest of the code can stay untouched.

We can always add automatic import to re:compile and/or re:run later if we find it useful.

sverker avatar Aug 11 '25 14:08 sverker

Got it, thank you. I think I misunderstood it initially but it is now clear to me: I need to call re:export at compile time and have re:import({re_exported_pattern, ...}) in the Erlang AST. That's what will be seen and optimized the loader. This way, exported regexes also won't show up anywhere else in the code, because they are converted into regular ones by the loader.

josevalim avatar Aug 11 '25 15:08 josevalim

Yes. Except, instead of a new re:export you call re:compile with option export. I don't remember now the reasoning why we preferred an option before a separate export function. Summer vacation amnesia.

sverker avatar Aug 11 '25 15:08 sverker

Given export returns a completely different opaque type, it may be handy to gate it behind a separate function indeed. But from my side they both work the same.

josevalim avatar Aug 11 '25 16:08 josevalim

Thank you so much @sverker and everybody involved! Successfully integrated in Elixir (PR), looking forward to 28.1's release 🚀

sabiwara avatar Aug 30 '25 09:08 sabiwara