libfsm
libfsm copied to clipboard
Implementing replaceAll with generated matchers?
I'm trying to port a use of golang's https://golang.org/pkg/regexp/#Regexp.ReplaceAllString to tinygo, where regexp doesn't work just yet. libfsm works nicely, i.e. I can use
re -l go -k str "ab"
to generate a nice matcher. But it isn't especially helpful for implementing ReplaceAllString, as it returns neither the starting nor the ending index.
Returning starting and ending indices seems easy, e.g.
--- a/src/libfsm/print/go.c
+++ b/src/libfsm/print/go.c
@@ -92,14 +92,14 @@ print_end(FILE *f, const struct dfavm_op_ir *op, const struct fsm_options *opt,
enum dfavm_op_end end_bits, const struct ir *ir)
{
if (end_bits == VM_END_FAIL) {
- fprintf(f, "{\n\t\treturn -1\n\t}\n");
+ fprintf(f, "{\n\t\treturn -1, -1\n\t}\n");
return;
}
if (opt->endleaf != NULL) {
opt->endleaf(f, op->ir_state->end_ids, opt->endleaf_opaque);
} else {
- fprintf(f, "{\n\t\treturn %lu\n\t}\n", (unsigned long) (op->ir_state - ir->states));
+ fprintf(f, "{\n\t\treturn int(start_idx), int(idx+1)\n\t}");
}
}
@@ -175,6 +175,9 @@ fsm_print_gofrag(FILE *f, const struct ir *ir, const struct fsm_options *opt,
if (op->num_incoming > 0) {
print_label(f, op, opt);
fprintf(f, "\n");
+ if (op->index == 0) {
+ fprintf(f, "\tvar start_idx = idx + 1\n");
+ }
}
fprintf(f, "\t");
@@ -257,7 +260,7 @@ fsm_print_go(FILE *f, const struct fsm *fsm)
switch (fsm->opt->io) {
case FSM_IO_PAIR:
- fprintf(f, "(data []byte) int {\n");
+ fprintf(f, "(data []byte) (start, end int) {\n");
if (ir->n > 0) {
/* start idx at -1 unsigned so after first increment we're correct at index 0 */
fprintf(f, "\tvar idx = ^uint(0)\n");
@@ -266,7 +269,7 @@ fsm_print_go(FILE *f, const struct fsm *fsm)
break;
case FSM_IO_STR:
- fprintf(f, "(data string) int {\n");
+ fprintf(f, "(data string) (start, end int) {\n");
if (ir->n > 0) {
/* start idx at -1 unsigned so after first increment we're correct at index 0 */
fprintf(f, "\tvar idx = ^uint(0)\n");
and would let me implement a ReplaceAllString.
Is there an easier way? I feel like I haven't yet found the appropriate doc...
Here's a demo of what I'm trying to do. Not quite happy yet...
package main
import (
"fmt"
"regexp"
)
func main() {
input := "aaa aaa"
// Porting FindStringIndex() to fsm
r := regexp.MustCompile(`a{2,3}`)
match := r.FindStringIndex(input)
fmt.Printf("regexp: r.FindStringIndex returns %d, %d\n", match[0], match[1])
start, end := Match(input)
fmt.Printf("fsm: Match returns %d, %d\n", start, end)
// Porting ReplaceAll to fsm
out := r.ReplaceAllString(input, "b")
fmt.Printf("regexp: ReplaceAllString returns %s\n", out)
out = myReplaceAll(input, "b")
fmt.Printf("fsm: myReplaceAll returns %s\n", out)
}
// myReplaceAll finds all sequences in value that Match, and replaces them with repl.
// Any matches that might result from the replacement are ignored.
func myReplaceAll(value, repl string) string {
remaining := value
out := ""
for remaining != "" {
start, end := Match(remaining)
if start == -1 {
out = out + remaining
remaining = ""
} else {
println("match", remaining[start:end], "start", start, "end", end)
out = out + remaining[0:start] + repl
remaining = remaining[end:]
}
}
return out
}
Save that as main.go, then create match.go with something like
re -k str -l go -r pcre 'a{2,3}' | sed 's/fsm_fsm/main/' > match.go
and run with
go run main.go match.go
Currently, it outputs
regexp: r.FindStringIndex returns 0, 3
fsm: Match returns 0, 2
regexp: ReplaceAllString returns b b
match aa start 0 end 2
match aa start 2 end 4
fsm: myReplaceAll returns ba ba
Examining the generated code, fsm expands the quantifier a{2,3} to return immediately after the 2nd a, without even trying to consume a 3rd one. (I filed https://github.com/katef/libfsm/issues/358 for that earlier, but may not have provided enough context.)
This is a fine optimization for detecting a match, so evidently my hack to return the end offset of the match is breaking an assumption libfsm made, and/or I'm misusing libfsm in some other way.
I think you need to call Match again after you replace to find the next index. Also a good idea to use a slice instead of string here so you avoid a bunch of allocations.
Looks like you're at Fastly, so may be worth looking at the Varnish implementation if that still exists. In particular you need to decide what to do if the replacement results in another match.
On Tue, Jun 1, 2021 at 20:24 dkegel-fastly @.***> wrote:
I'm flailing around a bit trying to figure out how to port that regexp to libfsm. Using
re -k str -l go -r pcre 'a{2,}[^a]' | sed 's/fsm_fsm/main/' > match.go
outputs
regexp: r.FindStringIndex returns 0, 3 fsm: Match returns 0, 3 regexp: ReplaceAllString returns b b fsm: myReplaceAll returns b aaa
Using
re -k str -l go -r pcre 'a{2,}' | sed 's/fsm_fsm/main/' > match.go
yields
regexp: r.FindStringIndex returns 0, 3 fsm: Match returns 0, 1 regexp: ReplaceAllString returns b b fsm: myReplaceAll returns bba bba
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/katef/libfsm/issues/357#issuecomment-852638614, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABJFR36WPKJXSDQ2JZDHW3TQWB6JANCNFSM455AH5UA .
Thanks. I updated the problem description to address your last suggestion about induced matches. I'm not worried about the inefficiency of using string for the moment.
I am very interested in suggestions for coaxing libfsm into actually finishing the match of a{2,3} instead of calling it a day after aa.