rx, a program for compiling sets of regular expressions

Open katef opened this issue 1 year ago • 0 comments

From the manpage:

Input files have one pattern per line. Each pattern has an associated id. ids are assigned depending on the number of input files. For a single file, ids are assigned per pattern (that is, the id is the line number within the file). For multiple files, ids are assigned per file (that is, the same id is shared by all patterns within a file).

Pattern ids are made available to the generated code when successfully matching a set of one or more patterns. You can see these with -l dot output. It is possible for a given text string to match patterns associated with different ids. There are several ways to deal with this, which of these is appropriate depends on the application:

Error about it at compile time. This is the default for rx(1) To use this mode, ensure your patterns don't overlap. In particular you can use rx -q as a lint to find conflicts.

Give conflicting patterns the same id in the first place. This would be the case for a lexer, where you might have multiple spellings that produce the same token.

Allow ambiguous patterns, and the generated API returns a set of ids. See -u.

Earliest line number (lower id) wins. This would suit a firewall-like application where it doesn't matter which See -t.

Longest match or most specific regex wins. This doesn't work for DFA and so is not provided by rx(1).

You can get some resource stats with -Q:

; ./build/bin/rx -Q -r literal -Fb -k str -l llvm /usr/share/dict/words > /tmp/w.ll
charset: [(none)]
reject: []
flags: 0x40
literals[0].count = 0
literals[1].count = 0
literals[2].count = 0
literals[3].count = 104334
literals (unanchored): 0 patterns, 2 states
literals (^left): 0 patterns, 2 states
literals (right$): 0 patterns, 1 states
literals (^both$): 104334 patterns, 238103 states
general: 0 patterns (limit 18446744073709551615)
declined: 0 patterns
fsm_count = 4 FSMs prior to union
nfa: 238111 states
dfa: 238104 states
rusage.utime: 8.37991
rusage.stime: 0.30796
rusage.maxrss: 524 MiB
;

There are a few small fixes and things on this branch, that superficially have nothing to do with rx. That's because I originally had much more groundwork here, which I've pulled out to separate PRs (especially #485 and #486, but also others). I want to keep the history for rx itself intact, rather than rebase away the stuff I moved out to other PRs. So I've merged over from main, and left the few seemingly-unrelated fixes without rebasing them out.

rx was named by @averymcnab

Aug 19 '24 22:08 katef