micro icon indicating copy to clipboard operation
micro copied to clipboard

enhance search API

Open matthias314 opened this issue 10 months ago • 27 comments

This PR is based on #3575 ~~and therefore a draft at present~~. It has the following components:

  • additional methods for searching text, for example methods that also return matched capturing groups,
  • a new type RegexpGroup that combines a regexp with its padded versions as used in #3575,
  • process Deltas in ExecuteTextEvent in reverse order. This makes replaceall easier to implement,
  • new functions LocVoid() and Loc.IsVoid() to deal with unused submatches.

The new types and functions are as follows (UPDATED):

// NewRegexpGroup creates a RegexpGroup from a string
func NewRegexpGroup(s string) (RegexpGroup, error)

// FindDown returns a slice containing the start and end positions
// of the first match of `rgrp` between `start` and `end`, or nil
// if no match exists.
func (b *Buffer) FindDown(rgrp RegexpGroup, start, end Loc) []Loc

// FindDownSubmatch returns a slice containing the start and end positions
// of the first match of `rgrp` between `start` and `end` plus those
// of all submatches (capturing groups), or nil if no match exists.
func (b *Buffer) FindDownSubmatch(rgrp RegexpGroup, start, end Loc) []Loc

// FindUp returns a slice containing the start and end positions
// of the last match of `rgrp` between `start` and `end`, or nil
// if no match exists.
func (b *Buffer) FindUp(rgrp RegexpGroup, start, end Loc) []Loc

// FindUpSubmatch returns a slice containing the start and end positions
// of the last match of `rgrp` between `start` and `end` plus those
// of all submatches (capturing groups), or nil if no match exists.
func (b *Buffer) FindUpSubmatch(rgrp RegexpGroup, start, end Loc) []Loc

// FindAllFunc calls the function `f` once for each match between `start`
// and `end` of the regexp given by `s`. The argument of `f` is the slice
// containing the start and end positions of the match. FindAllFunc returns
// the number of matches plus any error that occured when compiling the regexp.
func (b *Buffer) FindAllFunc(s string, start, end Loc, f func([]Loc)) (int, error)

// FindAll returns a slice containing the start and end positions of all
// matches between `start` and `end` of the regexp given by `s`, plus any
// error that occured when compiling the regexp. If no match is found, the
// slice returned is nil.
func (b *Buffer) FindAll(s string, start, end Loc) ([][]Loc, error)

// FindAllSubmatchFunc calls the function `f` once for each match between
// `start` and `end` of the regexp given by `s`. The argument of `f` is the
// slice containing the start and end positions of the match and all submatches
// (capturing groups). FindAllSubmatch Func returns the number of matches plus
// any error that occured when compiling the regexp.
func (b *Buffer) FindAllSubmatchFunc(s string, start, end Loc, f func([]Loc)) (int, error)

// FindAllSubmatch returns a slice containing the start and end positions of
// all matches and all submatches (capturing groups) between `start` and `end`
// of the regexp given by `s`, plus any error that occured when compiling
// the regexp. If no match is found, the slice returned is nil.
func (b *Buffer) FindAllSubmatch(s string, start, end Loc) ([][]Loc, error)

// ReplaceAll replaces all matches of the regexp `s` in the given area. The
// new text is obtained from `template` by replacing each variable with the
// corresponding submatch as in `Regexp.Expand`. The function returns the
// number of replacements made, the new end position and any error that
// occured during regexp compilation
func (b *Buffer) ReplaceAll(s string, start, end Loc, template []byte) (int, Loc, error)

// ReplaceAllLiteral replaces all matches of the regexp `s` with `repl` in
// the given area. The function returns the number of replacements made, the
// new end position and any error that occured during regexp compilation
func (b *Buffer) ReplaceAllLiteral(s string, start, end Loc, repl []byte) (int, Loc, error)

// ReplaceAllFunc replaces all matches of the regexp `s` with `repl(match)`
// in the given area, where `match` is the slice containing start and end
// positions of the match. The function returns the number of replacements
// made, the new end position and any error that occured during regexp
// compilation
func (b *Buffer) ReplaceAllFunc(s string, start, end Loc, repl func(match []Loc) []byte) (int, Loc, error)

// ReplaceAllSubmatchFunc replaces all matches of the regexp `s` with
// `repl(match)` in the given area, where `match` is the slice containing
// start and end positions of the match and all submatches. The function
// returns the number of replacements made, the new end position and any
// error that occured during regexp compilation
func (b *Buffer) ReplaceAllSubmatchFunc(s string, start, end Loc, repl func(match []Loc) []byte) (int, Loc, error)

// MatchedStrings converts a slice containing start and end positions of
// matches or submatches to a slice containing the corresponding strings.
func (b *Buffer) MatchedStrings(locs []Loc) ([]string)

// LocVoid returns a Loc strictly smaller then any valid buffer location
func LocVoid() Loc

// IsVoid returns true if the location l is void
func (l Loc) IsVoid() bool

The method FindNext is kept. ReplaceRegex is removed in favor of ReplaceAll. The latter is easier to use in Lua scripts.

Currently the simple search functions (FindDown etc.) take a RegexpGroup as argument to avoid recompiling the regexps. In contrast, FindAll, ReplaceAll and friends take a string argument. Many other variants would be possible. Also, the new search functions ignore the ignorecase setting of the buffer and don't wrap around when they hit the end of the search region. I think they are more useful this way in Lua scripts.

You will see that many new internal functions use callback functions. This avoids code duplication. (One has to somehow switch between (*regexp.Regexp).FindIndex() and (*regexp.Regexp).FindSubmatchIndex() in the innermost function that searches each line of the buffer.)

As said before, many details could be modified, but overall I think these functions are very useful for writing scripts. Please let me know what you think.

matthias314 avatar Feb 08 '25 03:02 matthias314

I've rebased the PR onto master and added NewRegexpGroup to the documentation.

matthias314 avatar Feb 09 '25 17:02 matthias314

The latest commit fixes a subtle bug related to the padding of the search region: In the presence of combining characters, one could end up with an infinite loop. (Try searching backwards for . in the line x⃗y⃗z⃗.)

This bug is also present in #3575, hence in master. If you want, I can backport 88f3cf5 to master. This would require some modification of the commit, so let me know if that's necessary.

matthias314 avatar Feb 09 '25 22:02 matthias314

I've force-pushed a polished version and updated the list of functions at the top of this page. It still fixes the bug mentioned above. Also, locations returned for matches and submatches are now guaranteed to include the runes that matched. The underlying Go regexp functions match runes, which may be part of combining characters like x⃗. The start and end locations now are such that the characters between them include all matching runes. This is not the case on master. (Search backwards for . in a row consisting of many x⃗ to see the difference.)

This PR introduces many new functions. Maybe we don't need all of them. For example, do we need a submatch version on top of each non-submatch search or replace function? If we keep only the submatch version, we wouldn't lose any functionality because the additional elements in the slice for each match can be ignored. I nevertheless added all functions to this PR to show what's possible.

matthias314 avatar Feb 22 '25 21:02 matthias314

@dmaluka Any chance to move forward with this PR?

Apart from a detailed review, a quick feedback would be helpful:

  • Do we need all search and replace functions I have defined? For example, we could drop the non-submatch functions (like FindAll) and rename the submatch versions (FindAll instead of FindAllSubmatch) without losing any functionality. (Is there a significant performance difference between the two?)
  • For the search functions accepting a regexp (or rather a RegexpGroup), should there be companion functions accepting strings? For example, FindDown could be renamed FindDownRegexpGroup, and a new function FindDown would accept a regexp given by a string. That could be convenient in Lua scripts. EDIT: In the latest version, all new search and replace functions allow the regexp to be either a string or RegexpGroup.
  • Currently the search and replace functions try to be smart, already the existing FindNext: If the end of the search region precedes the start, then they are swapped. This is convenient when searching/replacing within the cursor selection. However, should general-purpose functions better be simple and "dumb"?

matthias314 avatar Mar 01 '25 16:03 matthias314

Haven't had much time to look into it yet, sorry.

dmaluka avatar Mar 01 '25 19:03 dmaluka

To see some of the new search functions in action, check out my LaTeX plugin. The way the new ReplaceAll function works is also useful for my bookmark plugin because it allows the simplified bookmark handling implemented there to work with undoing replacement operations. (Both plugins are under development and need a custom version of micro.)

matthias314 avatar Mar 03 '25 20:03 matthias314

In the latest version, the regexp for all new search and replace functions can be specified either as a string or as a RegexpGroup. The latter is better for performance (because it avoids compiling the same regexp multiple times) while the former is easier to use in Lua scripts.

matthias314 avatar Mar 05 '25 00:03 matthias314

// NewRegexpGroup creates a RegexpGroup from a string
func NewRegexpGroup(s string) (RegexpGroup, error)

I'm worried that we are exposing such implementation details as padded regexps as a part of the API. I think we should try and make it a bit more abstract and future-proof, e.g. something like:

type RegexpSearch struct {
	// We want "^" and "$" to match only the beginning/end of a line [...]
	regex [4]*regexp.Regexp
}

func NewRegexpSearch(s string) (*RegexpSearch, error)

In the latest version, all new search and replace functions allow the regexp to be either a string or RegexpGroup.

And IMHO this implicit polymorphism is messy.

I'm thinking of something like:

func (b *Buffer) FindDown(s string, start, end Loc, useRegex bool, ignoreCase bool) []Loc
func (b *Buffer) FindUp(s string, start, end Loc, useRegex bool, ignoreCase bool) []Loc

which internally use:

func NewRegexpSearch(s string) (*RegexpSearch, error)
func (b *Buffer) FindRegexpDown(search *RegexpSearch, start, end Loc) []Loc
func (b *Buffer) FindRegexpUp(search *RegexpSearch, start, end Loc) []Loc

we could drop the non-submatch functions (like FindAll) and rename the submatch versions (FindAll instead of FindAllSubmatch) without losing any functionality.

Seems reasonable. The caller can ignore the returned submatches if it doesn't need them.

Currently the search and replace functions try to be smart, already the existing FindNext: If the end of the search region precedes the start, then they are swapped. This is convenient when searching/replacing within the cursor selection. However, should general-purpose functions better be simple and "dumb"?

Seems reasonable. The intuitively expected behavior would be: if start is greater than end, treat the range as empty and thus return no matches.

And I'm not even sure why exactly we currently swap them. From looking at the code it seems like we already make sure to always pass start less or equal to end (except findUp(), where we on the contrary always pass start greater or equal than end, so we might want to swap them just in findUp(), and unconditionally?).

dmaluka avatar Mar 09 '25 12:03 dmaluka

The intuitively expected behavior would be: if start is greater than end, treat the range as empty and thus return no matches.

Exactly. Another option would be to combine FindUp and FindDown into a single function Find(search string, start, end Loc). The search would be downwards if start is less than or equal to end and otherwise upwards. This would reduce the number of methods we define, but may be too "smart". What do you think?

matthias314 avatar Mar 09 '25 13:03 matthias314

Yes, it would be too smart.

dmaluka avatar Mar 09 '25 13:03 dmaluka

type RegexpSearch struct {

The struct is a good idea. Are you attached to the name RegexpSearch? I find that such a struct is not more related to searching than a single Regexp. I don't want to claim that RegexpGroup is the ideal name, but it conveys the idea that several regexps are grouped together.

func (b *Buffer) FindDown(s string, start, end Loc, useRegex bool, ignoreCase bool) []Loc

I wonder how convenient the arguments useRegex and ignoreCase would be in Lua scripts. (My general approach is that the API should be easy to use from Lua.) If one has an explicit repexp, then one can modify it directly. Moreover, ignoreCase may often just be the buffer setting. I have a draft PR where I use the new search functions in the rest of micro. (This makes the code simpler and shorter.) There I define the function

// RegexpString converts a search string into a string that can be compiled
// to a regexp. It can quotes special characters and switch to case-insensitive
// search if that is the setting for the buffer.
func (b *Buffer) RegexpString(s string, isRegexp bool) string {

Such a function might cover most uses of useRegex and ignoreCase. I'm asking myself whether these arguments to FindDown will be to be more of a help to Lua script writers or a burden.

matthias314 avatar Mar 09 '25 19:03 matthias314

I don't want to claim that RegexpGroup is the ideal name, but it conveys the idea that several regexps are grouped together.

That is exactly the kind of details that I'd prefer to hide, not expose.

dmaluka avatar Mar 09 '25 20:03 dmaluka

That is exactly the kind of details that I'd prefer to hide, not expose.

Fair enough. What about RegexpData? In the case of RegexpSearch one may wonder what the Search part means. Nobody would be puzzled about Data.

matthias314 avatar Mar 09 '25 21:03 matthias314

I've pushed my local git repository. All changes were made in March except for the last one. I haven't squashed the commits into one so far because this way it's easier to see what the intent of each change is. This new version would cleanly rebase onto master. The main changes are:

  • new function Expand to replace submatches in a template,
  • new search functions now always return submatches; "Submatch" not part of their name anymore,
  • some other functions were renamed,
  • fix #3700 and a minor bug in findCallback that prevented regexp error messages from being displayed.

I have to go over the previous again to see if there is any concern that isn't addressed in this version. If would be great if we could wrap this PR up now.

matthias314 avatar Jul 21 '25 03:07 matthias314

I haven't squashed the commits into one so far because this way it's easier to see what the intent of each change is.

And that is not exactly helpful for review. I just started reviewing the PR again from the beginning and started writing comments like the above one, before I noticed that the changes I'm commenting on are "outdated".

dmaluka avatar Aug 24 '25 13:08 dmaluka

Are you attached to the name RegexpSearch? I find that such a struct is not more related to searching than a single Regexp.

Not exactly attached, I just have no idea of a better name right now.

What about RegexpData? In the case of RegexpSearch one may wonder what the Search part means. Nobody would be puzzled about Data.

Everybody would be puzzled about what "data" we are talking about. (Maybe unless you clearly document it, but you haven't done that.)

RegexpSearch structure would represent an abstract state of a regexp-based search. For instance, now we only initialize this structure when creating it, i.e. when preparing for the search, and don't modify it afterwards during the search, but there is no reason why wouldn't we modify this structure in consecutive FindRegexpDown() calls etc if we had any reason to do so in the future, and that wouldn't require changing the API, and the API users wouldn't need to bother thinking about such details.

dmaluka avatar Aug 24 '25 13:08 dmaluka

I haven't squashed the commits into one so far because this way it's easier to see what the intent of each change is.

And that is not exactly helpful for review. I just started reviewing the PR again from the beginning and started writing comments like the above one, before I noticed that the changes I'm commenting on are "outdated".

Just noticed that you said "I haven't squashed the commits into one"... So, just in case, I'm not suggesting to squash them all into one commit. What I'm suggesting (as always) is: separate commits per separate functional or logical changes (as that makes PRs easier and faster to review), but without changes outdated by later changes in the same PR (as that makes PRs needlessly harder and slower to review).

dmaluka avatar Aug 24 '25 14:08 dmaluka

@matthias314: Can you please rebase your PR accordingly?

JoeKar avatar Oct 19 '25 11:10 JoeKar

OK. I'm busy at the moment. I'll do it soon.

matthias314 avatar Oct 21 '25 18:10 matthias314

I've rebased the PR, hopefully in the way you had in mind. Please have a look at it.

matthias314 avatar Oct 29 '25 01:10 matthias314

I haven't properly read this thread or tried to review this PR yet, but I think 362e10ba007ca3d9008ed032ce02be60c916e23d isn't like what @dmaluka requested. It might take a while to think and split changes like so on a PR at this size.

Some commits months ago may be different, but I think this is how the maintainers organized changes on large PRs:

  1. With one commit, attempt to show and achieve only one feature or small goal that could be remembered like a step
  2. Moving existing code is done alone under one individual commit, and changes on the moved code are done in later commits
  3. As much as possible, changes in one commit should be possible to summarize in the subject around the conventional limit of 50 characters, with enough detail to understand it in general

Although not good examples, I have 2 smaller unrelated experimental branches which are Clipper v2 and short-usage on Micro. I believe the above points can be observed on some commits there:

  • Like point 1, niten94/micro@a423c482d1616274a190512540b5c2c9e71eae6a applies consistent style used by text modified later in niten94/micro@6118bb65852c0f1acf9c7b1426b07b2dbff1f963
  • Like point 2, niten94/micro-clipper@cab6de97dee5b121367507486e46dc04015da1b5 moves code into a new type (that was supposed to be) updated in later commits without changing functionality yet

Additionally, I think it would be fine to add small functions like ones under util in the same commit that requires them.

niten94 avatar Oct 30 '25 18:10 niten94

I am sorry that my comment was unclear on how this PR should exactly proceed. To be clear, I'm only seeing if I can help with anything and decided to join in the review.

Since I have a lot more time at the moment, I'm currently looking more into this PR. I think the general idea is good, but first: it doesn't seem like there had been any pushes which more clearly separates the changes in search.go.

I posted the above comment to implicitly explain differently how the PR should be rebased, and suggest to @matthias314 to try rebasing in such way since it's not easy for others to see the thought process by step anytime.

But since I have time now, I am now trying to rebase the PR without any differences to the end result of the modifications. Anyone's rebase could also probably be used as a reference to how @matthias314 (as the author) should make commits here, and not only make review easier.

niten94 avatar Nov 03 '25 03:11 niten94

@niten94 Thanks a lot for looking into this!

I found it difficult to combine the existing commits into smaller groups because all those which are now under the umbrella of "modified search and replace methods" had the same goal. Taken alone, any smaller commit did not seem to make much sense. One indication of this is that the commits were interleaved, leading to merge conflicts when moving them around during a rebase.

matthias314 avatar Nov 03 '25 14:11 matthias314

I meant that the changes can be divided into smaller goals within the code, however I see the difficulty of it now.

I've pushed my attempt to rebase at wip/enhance-search for now. I haven't been able to include most changes yet, but I still believe (and hope) it can be done. Even though the diff doesn't seem displayed as expected on GitHub, the rest in search.go should be cleaner now as seen with a command like below if the remote names match:

git diff niten94/wip/enhance-search...matthias314/m3/find-func

niten94 avatar Nov 03 '25 17:11 niten94

Sorry for the long delay.

I've now rebased the whole PR at my branch and included small documentation changes. Without adding changes outdated later in the PR, I think it's impossible to produce a commit with a clearer diff at the part near ReplaceRegex even with more time than reasonable.

@matthias314 could "reset" (not sure if it's the right word) to my branch so that the review can be resumed, right?

It's fine to alter it by ways like moving commits around or rewording the messages for example, but hopefully it won't be too hard to work on.

niten94 avatar Nov 15 '25 10:11 niten94

so that the review can be resumed

Just in case, actually the main reason why the review (at least from my side) is currently in hold is just because I'm busy af lately. Sorry for that. I hope I'll get back to it any time soon.

dmaluka avatar Nov 15 '25 13:11 dmaluka