regex-benchmark icon indicating copy to clipboard operation
regex-benchmark copied to clipboard

D: use ctRegex

Open kubo39 opened this issue 1 year ago • 6 comments

kubo39 avatar Apr 13 '23 10:04 kubo39

Thanks, my knowledge about D is limited, so I have some doubts.

  • What is the difference from the actual implementation? And why do you want to change it?
  • Does it make sense to keep only one implementation or both?

mariomka avatar Apr 14 '23 06:04 mariomka

Hi,

ctRegex compiles regular expression at compile-time.

I expected three things for performnace:

  1. avoid runtime regex construction cost, including for unicode.
  2. avoid heap allocations.
  3. compiles to native code and could be replaced with specialized instruction set.

see also this cool article, it's for rust's regex! macro (deprecated now described here, but very useful!).

However, in my local, the benchmark shows no difference. (sorry, I should check before send PR!)

I'm digging, and will close it found the reason.

kubo39 avatar Apr 14 '23 07:04 kubo39

Thanks for the info!

I ran it on my computer, and there's a small change, not huge, but it's better.

mariomka avatar Apr 14 '23 07:04 mariomka

DMD - v2.103.0

  • slower than optimized branch.
(dmd-2.103.0)kubo39@hinoda:~/dev/kubo39/regex-benchmark$ git branch
* d-compile-time-regex
  master
  optimized
(dmd-2.103.0)kubo39@hinoda:~/dev/kubo39/regex-benchmark$ dmd -O -release d/benchmark.d
(dmd-2.103.0)kubo39@hinoda:~/dev/kubo39/regex-benchmark$ ./benchmark input-text.txt
307.404800 - 92
300.025700 - 5301
4.375800 - 5
(dmd-2.103.0)kubo39@hinoda:~/dev/kubo39/regex-benchmark$ git checkout optimized
Switched to branch 'optimized'
Your branch is up to date with 'upstream/optimized'.
(dmd-2.103.0)kubo39@hinoda:~/dev/kubo39/regex-benchmark$ dmd -O -release d/benchmark.d
(dmd-2.103.0)kubo39@hinoda:~/dev/kubo39/regex-benchmark$ ./benchmark input-text.txt
262.630300 - 92
269.145000 - 5301
5.823400 - 5
(dmd-2.103.0)kubo39@hinoda:~/dev/kubo39/regex-benchmark$ ./benchmark input-text.txt
264.894800 - 92
268.622300 - 5301
5.635600 - 5
(dmd-2.103.0)kubo39@hinoda:~/dev/kubo39/regex-benchmark$ git checkout d-compile-time-regex
Switched to branch 'd-compile-time-regex'
Your branch is up to date with 'origin/d-compile-time-regex'.
(dmd-2.103.0)kubo39@hinoda:~/dev/kubo39/regex-benchmark$ dmd -O -release d/benchmark.d
(dmd-2.103.0)kubo39@hinoda:~/dev/kubo39/regex-benchmark$ ./benchmark input-text.txt
290.224200 - 92
283.388900 - 5301
4.662600 - 5

LDC - v1.32.0

  • much faster than optimized.
(ldc-1.32.0)kubo39@hinoda:~/dev/kubo39/regex-benchmark$ git checkout optimized
Switched to branch 'optimized'
Your branch is up to date with 'upstream/optimized'.
(ldc-1.32.0)kubo39@hinoda:~/dev/kubo39/regex-benchmark$ ldc2 -O3 -release d/benchmark.d
(ldc-1.32.0)kubo39@hinoda:~/dev/kubo39/regex-benchmark$ git branch
  d-compile-time-regex
  master
* optimized
(ldc-1.32.0)kubo39@hinoda:~/dev/kubo39/regex-benchmark$ ldc2 -O3 -release d/benchmark.d
(ldc-1.32.0)kubo39@hinoda:~/dev/kubo39/regex-benchmark$ ./benchmark input-text.txt
167.561100 - 92
163.916900 - 5301
4.397100 - 5
(ldc-1.32.0)kubo39@hinoda:~/dev/kubo39/regex-benchmark$ git checkout d-compile-time-regex
Switched to branch 'd-compile-time-regex'
Your branch is up to date with 'origin/d-compile-time-regex'.
(ldc-1.32.0)kubo39@hinoda:~/dev/kubo39/regex-benchmark$ ldc2 -O3 -release d/benchmark.d
(ldc-1.32.0)kubo39@hinoda:~/dev/kubo39/regex-benchmark$ ./benchmark input-text.txt
88.026900 - 92
88.755400 - 5301
3.594100 - 5
(ldc-1.32.0)kubo39@hinoda:~/dev/kubo39/regex-benchmark$ ./benchmark input-text.txt
88.634500 - 92
88.571300 - 5301
3.624200 - 5

kubo39 avatar Apr 14 '23 17:04 kubo39

Just use:

import std.array;

auto m = data.matchAll(ctRegex!(pattern));
count = cast(int) m.array.length;

It is easy to read and run faster than foreach.

cyrusmsk avatar May 31 '23 08:05 cyrusmsk

Thanks, my knowledge about D is limited, so I have some doubts.

  • What is the difference from the actual implementation? And why do you want to change it?
  • Does it make sense to keep only one implementation or both?

I propose to remain both. Currently ctRegex should work faster. But in D community many people don't like this approach - because it increase compilation time significantly. There were even some talks to remove ctRegex from std library. But it is just some rumors - and it is better to have both. It will be ease to remove one solution in future in case something will changed.

cyrusmsk avatar May 31 '23 08:05 cyrusmsk