rune icon indicating copy to clipboard operation
rune copied to clipboard

Regex Library

Open CeleritasCelery opened this issue 2 years ago • 0 comments

Emacs regex is similar to PCRE regex. In that case we could use the fancy-regex crate (which implements a backtracking engine), once #84 is fixed. However there are still several differences that would need to be handled.

meta characters

Emacs regex meta characters are backwards from what most regex use. For example () represent literal parens, and \(\) is a capture group. Also | is literal, and \| is alternation. This is easy enough to fix with pre-processing the regex.

syntax aware matches

Several of the regex patterns match on the syntax definition of characters.

  • \w: word character
  • \s: match syntax class

"Word" and "symbol" are defined by the major modes syntax table. You could transform these into general character classes ([...]) for the rust regex engine.

There is also the special character \=, which matches the point. To handle this you could split the buffer into two parts; before point and after point. Then match each half separately.

boundaries

Emacs defines a regex for the boundary of words and symbols.

  • \<: beginning of word
  • \>: end of word
  • \_<: beginning of symbol
  • \_>: end of symbol

these will need to be implemented with look-arounds. You can’t even build them into the regex engine because they can change per major mode.

Buffer Gap

Most performance oriented regex libraries expect to operate on contiguous data. However a gap buffer will have a gap of garbage data somewhere in the buffer. This becomes a problem when the span of the regex search crosses the gap. The simplest solution here is to move the gap outside of the range of the search. This could performance issues if the lines are really long. We also have to consider how to match multiline regex. Not sure of a good way to handle that. Here are some notes from the remacs project.

CeleritasCelery avatar Jan 14 '23 19:01 CeleritasCelery