libfsm icon indicating copy to clipboard operation
libfsm copied to clipboard

Implementing replaceAll with generated matchers?

Open dkegel-fastly opened this issue 4 years ago • 3 comments

I'm trying to port a use of golang's https://golang.org/pkg/regexp/#Regexp.ReplaceAllString to tinygo, where regexp doesn't work just yet. libfsm works nicely, i.e. I can use

re -l go -k str  "ab"

to generate a nice matcher. But it isn't especially helpful for implementing ReplaceAllString, as it returns neither the starting nor the ending index.

Returning starting and ending indices seems easy, e.g.

--- a/src/libfsm/print/go.c
+++ b/src/libfsm/print/go.c
@@ -92,14 +92,14 @@ print_end(FILE *f, const struct dfavm_op_ir *op, const struct fsm_options *opt,
        enum dfavm_op_end end_bits, const struct ir *ir)
 {
        if (end_bits == VM_END_FAIL) {
-               fprintf(f, "{\n\t\treturn -1\n\t}\n");
+               fprintf(f, "{\n\t\treturn -1, -1\n\t}\n");
                return;
        }
 
        if (opt->endleaf != NULL) {
                opt->endleaf(f, op->ir_state->end_ids, opt->endleaf_opaque);
        } else {
-               fprintf(f, "{\n\t\treturn %lu\n\t}\n", (unsigned long) (op->ir_state - ir->states));
+               fprintf(f, "{\n\t\treturn int(start_idx), int(idx+1)\n\t}");
        }
 }
 
@@ -175,6 +175,9 @@ fsm_print_gofrag(FILE *f, const struct ir *ir, const struct fsm_options *opt,
                if (op->num_incoming > 0) {
                        print_label(f, op, opt);
                        fprintf(f, "\n");
+                       if (op->index == 0) {
+                               fprintf(f, "\tvar start_idx = idx + 1\n");
+                       }
                }
 
                fprintf(f, "\t");
@@ -257,7 +260,7 @@ fsm_print_go(FILE *f, const struct fsm *fsm)
 
        switch (fsm->opt->io) {
        case FSM_IO_PAIR:
-               fprintf(f, "(data []byte) int {\n");
+               fprintf(f, "(data []byte) (start, end int) {\n");
                if (ir->n > 0) {
                        /* start idx at -1 unsigned so after first increment we're correct at index 0 */
                        fprintf(f, "\tvar idx = ^uint(0)\n");
@@ -266,7 +269,7 @@ fsm_print_go(FILE *f, const struct fsm *fsm)
                break;
 
        case FSM_IO_STR:
-               fprintf(f, "(data string) int {\n");
+               fprintf(f, "(data string) (start, end int) {\n");
                if (ir->n > 0) {
                        /* start idx at -1 unsigned so after first increment we're correct at index 0 */
                        fprintf(f, "\tvar idx = ^uint(0)\n");

and would let me implement a ReplaceAllString.

Is there an easier way? I feel like I haven't yet found the appropriate doc...

dkegel-fastly avatar Jun 01 '21 16:06 dkegel-fastly

Here's a demo of what I'm trying to do. Not quite happy yet...

package main
  
import (
        "fmt"
        "regexp"
)

func main() {
        input := "aaa aaa"

        // Porting FindStringIndex() to fsm
        r := regexp.MustCompile(`a{2,3}`)
        match := r.FindStringIndex(input)
        fmt.Printf("regexp: r.FindStringIndex returns %d, %d\n", match[0], match[1])
        start, end := Match(input)
        fmt.Printf("fsm: Match returns %d, %d\n", start, end)

        // Porting ReplaceAll to fsm
        out := r.ReplaceAllString(input, "b")
        fmt.Printf("regexp: ReplaceAllString returns %s\n", out)
        out = myReplaceAll(input, "b")
        fmt.Printf("fsm: myReplaceAll returns %s\n", out)
}

// myReplaceAll finds all sequences in value that Match, and replaces them with repl.
// Any matches that might result from the replacement are ignored.
func myReplaceAll(value, repl string) string {
        remaining := value
        out := ""
        for remaining != "" {
                start, end := Match(remaining)
                if start == -1 {
                        out = out + remaining
                        remaining = ""
                } else {
                        println("match", remaining[start:end], "start", start, "end", end)
                        out = out + remaining[0:start] + repl
                        remaining = remaining[end:]
                }
        }
        return out
}

Save that as main.go, then create match.go with something like

re -k str -l go -r pcre 'a{2,3}' | sed 's/fsm_fsm/main/' > match.go

and run with

go run main.go match.go

Currently, it outputs

regexp: r.FindStringIndex returns 0, 3
fsm: Match returns 0, 2
regexp: ReplaceAllString returns b b
match aa start 0 end 2
match aa start 2 end 4
fsm: myReplaceAll returns ba ba

Examining the generated code, fsm expands the quantifier a{2,3} to return immediately after the 2nd a, without even trying to consume a 3rd one. (I filed https://github.com/katef/libfsm/issues/358 for that earlier, but may not have provided enough context.)

This is a fine optimization for detecting a match, so evidently my hack to return the end offset of the match is breaking an assumption libfsm made, and/or I'm misusing libfsm in some other way.

dkegel-fastly avatar Jun 02 '21 01:06 dkegel-fastly

I think you need to call Match again after you replace to find the next index. Also a good idea to use a slice instead of string here so you avoid a bunch of allocations.

Looks like you're at Fastly, so may be worth looking at the Varnish implementation if that still exists. In particular you need to decide what to do if the replacement results in another match.

On Tue, Jun 1, 2021 at 20:24 dkegel-fastly @.***> wrote:

I'm flailing around a bit trying to figure out how to port that regexp to libfsm. Using

re -k str -l go -r pcre 'a{2,}[^a]' | sed 's/fsm_fsm/main/' > match.go

outputs

regexp: r.FindStringIndex returns 0, 3 fsm: Match returns 0, 3 regexp: ReplaceAllString returns b b fsm: myReplaceAll returns b aaa

Using

re -k str -l go -r pcre 'a{2,}' | sed 's/fsm_fsm/main/' > match.go

yields

regexp: r.FindStringIndex returns 0, 3 fsm: Match returns 0, 1 regexp: ReplaceAllString returns b b fsm: myReplaceAll returns bba bba

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/katef/libfsm/issues/357#issuecomment-852638614, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABJFR36WPKJXSDQ2JZDHW3TQWB6JANCNFSM455AH5UA .

dhobsd avatar Jun 02 '21 01:06 dhobsd

Thanks. I updated the problem description to address your last suggestion about induced matches. I'm not worried about the inefficiency of using string for the moment.

I am very interested in suggestions for coaxing libfsm into actually finishing the match of a{2,3} instead of calling it a day after aa.

dkegel-fastly avatar Jun 02 '21 16:06 dkegel-fastly