grmtools icon indicating copy to clipboard operation
grmtools copied to clipboard

Initial experiments with serialized testing

Open ratmice opened this issue 6 months ago • 2 comments

This is mostly just for collecting feedback, as this patch is fairly incomplete, It implements the serialization stuff, and it adds a -s switch to nimbleparse to output the serialized format,

It also adds some design docs which is a basic overview of what I intend to do, but this is necessarily incomplete, and at times speculative based upon how I hope or imagine it might be made to work.

I don't think it is too far from actually deserializing tests and doing the pass/fail checking, but before going through the trouble of implementing that, it seemed worth doing a fairly complete proof of concept of the serialization format stuff.

The basic goals here have been to keep the syntax fairly noise free while being well typed, I.e. instead of dynamically doing tests and vectors of tests, using a separate extension for the vector of tests. and work well with both raw strings/and escaped strings, for grammars containing quotation marks, newlines, raw strings, etc.

While doing this, I noted that test_files currently only allows a single glob, but when we start dealing with multiple file extensions, we probably are going to need a way to specify multiple globs.

ratmice avatar Jun 13 '25 20:06 ratmice

Let me start with a dumb question: I think what this PR is aiming to do is a way of saying "for a grammar G and an input I, let me write a test for it". And, I think, furthermore it's trying to allow users to say "I'm only going to write a test for a subset of I w.r.t. G"?

ltratt avatar Jun 13 '25 22:06 ltratt

I'm not certain I understand the second part specifically w.r.t subset. But I believe the answer is yes, the overall testing process is this: Edit: (unless you mean subset with regards to the recent clone_and_change_start_rule function in which case the answer is no this doesn't attempt to do anything with passing input to subsets of grammars regarding to start rules, I don't think that is what you meant because you say subset of I rather than subset of G.)

  1. Build an RTParserBuilder from the parser
  2. Read the .grmtest, or .grmtests file,
  3. For each GrammarTest pass the input field to the RTParserBuilder in step 1. Compare the output of parse_map building an ASTRepr to the fields of the GrammarTest deserialized in step 2 using the following:
    • If GrammarTest.pass is true and GrammarTest.ast.is_none(), check that the return value errors.is_empty()
    • If GrammarTest.pass is false and GrammarTest.errors.is_none() check that the returned errors match the errors field.
    • If GrammarTest.pass is true and GrammarTest.ast.is_some() check that the ASTRepr built by parse_map matches the Some(GrammarTest.ast).
    • If GrammarTest.pass is false and GrammarTest.errors.is_some() check that the errors returned by parse_map matches GrammarTest.errors.

So essentially there are 4 kinds of tests supported from it parses successfully, it should fail to parse, and more specifially it should parse successfully and produce this exact AST, and it should fail to parse and produce this exact error. One of these tests gets run for each input.

With a default value, of pass: true when (input: "text") is specified.

I guess I would say this more formally as:

"for a grammar G, an input I, and an expected result E, the test is the comparison (G(I) == E) "

ratmice avatar Jun 14 '25 04:06 ratmice

I quickly started looking at wildcards, got distracted, and dropped the ball. Sorry!

Let me take a step back and say what I think we've got here:

  1. The ability to take a grammar and input in and produce a stable, textual, version of the AST.
  2. The ability to compare two stable, textual, versions of the AST. [Do we compare the text or the ASTs?

Have I got that right?

Next question: if things don't match in (2), do we just get "fail" or is the aim to give a detailed diff of some sorts?

ltratt avatar Jun 25 '25 06:06 ltratt

No worries, this is I guess the kind of thing I'd prefer to spend time on and get it done right than get done quickly, Even if that means it misses the next release.

What I currently have implemented 1 is almost right except we do not have a stable textual version of the AST. We have the ability to produce an AST from a string, and an AST from a string, but the string formats are not stable.

For point 2 then, Serialization is not used for the actual testing, but only to encode asts into tests. Then we compare ASTs directly (i.e. PartialEq for AST), this generates an AST from an input string, then calls Deserialize on a string representation of an AST encoding the expected result (producing an expected AST), then does the AST comparison.

Most of my thoughts have aimed to preserve this basic structure.

e.g. If we have

enum AST {
   Term{...}
   NonTerm{...}
}

Then for the wildcards/matching

  enum ASTPattern {
   AST(ast),
   Wildcard,
   Hole,
  }

That is to say we would Deserialize an ASTPattern, and then using a custom method compare the AST produced from the input with the deserialized ASTPattern. I think it is possible to do that without having to deal with string escaping, or requiring a stable format (i.e. where minor things like whitespace changes won't start producing string comparison failures).

The down sides being i'm not super familiar with the state of the art of these tree/pattern matching algorithms and in many cases they seem to be NP-hard, anyhow that is at least where my head had largely been with this patch.

For your Next question: My Hope was that we'd have a few forms of tests output: For cases where we just have an input string, and a boolean which says whether we expect the test to just pass/fail, we'd just have a pass/fail.

Once we have an encoded ast we can either do something dumb (Probably a good starting point) and just dump both AST's in the future we could do something smarter like produce a diff of the ASTs my hope was to have diffing added in a follow-up, but keep it in mind so that we can reasonably implement it in the future.

I think for things with patterns the easiest thing to do is to output both the AST, and the pattern, maybe there is some sensible diff algorithm based on partial matches, but isn't clear to me.

ratmice avatar Jun 25 '25 18:06 ratmice

Quick dumb question, which might just be terminological: when you say "AST" I think I would say "parse tree"? In other words, we mean a tree that still has "all the stuff in it from the grammar", much of which a compiler might later choose to discard because it has no (subsequent) semantic effect.

I think the fundamental question in terms of effort then becomes:

  1. Do we just aim to facilitate "pure text matching" i.e. that grammar G and input I always give output T. We could even do that, I think, by just using nimbleparse's current output and simple textual equality.
  2. But pure text matching means you can't change G without effecting T, which makes it of limited use.

So because (2) is so limited, it makes us think "how do we introduce some sort of fuzziness" to matching so that not every test is invalidated as soon as G changes? There are various tree-based algorithms which could help with the fuzziness. I have seen someone implement those in the past, and it's doable, but not a 60 minute job by any means!

Another question is "how do we want users to encode these tests?" Do we want them to express them directly in Rust? Or some sort of DSL? Or both?

I'm actually starting to wonder if what we're slowly groping our way towards is a "grammar testing DSL". I don't know if any thing like this exists, but we can fairly easily imagine something. Let's imagine I have a calculator grammar G; I might then write multiple tests in our imaginary DSL (which, because I'm unimaginative, I'm thinking of in lang_tester-esque syntax) like this:

input: 2+3
tree:
  Start
    Plus
      Int 2
      Int 3

and:

input: 2 +
error: Insert "," at line 1 position 5

What about fuzzy matching? Well, I've deliberately used nimbleparse-ish parse-tree-indented syntax above. I think that lends itself well to very basic fm-textual wildcarding:

input: 2 + 3 * 4
tree:
  Start
    ...
      Mul
        Int 3
        Int 4

where I could change input to 2 / 3 * 4 and the test would still succeed.

You may notice, though, that I've still had to specify my indentation level very carefully in that test -- Mul is 1 level of indentation deeper than whatever ... matched. If I change my grammar and it has more nesting, should my test break?

I think the easy way around this would be to say that indentation levels in this DSL are relative, not absolute. So when Mul is "1 level of indentation" below ... we would interpret that as "at least 1 level of indentation below ...". That then allows me to change the structure of my grammar quite a bit and my test to succeed -- but not, I think, for my test to be so weak that it always succeeds.

This is a long morning ramble, and I might have completely failed to understand subtle points here, so buyer beware :)

ltratt avatar Jun 26 '25 06:06 ltratt

Quick dumb question, which might just be terminological: when you say "AST" I think I would say "parse tree"? In other words, we mean a tree that still has "all the stuff in it from the grammar", much of which a compiler might later choose to discard because it has no (subsequent) semantic effect.

Yeah, when I say "AST" I mean "parse tree", or "serializable generic parse tree", (I have a genetic issue with my hands that makes most of my fingers not actually bend, so I tend to fall back on names that are as few characters as possible, and shorten them by frequency of use, sorry)

I think the fundamental question in terms of effort then becomes:

1. Do we just aim to facilitate "pure text matching" i.e. that grammar G and input I always give output T. We could even do that, I think, by just using `nimbleparse`'s current output and simple textual equality.

2. _But_ pure text matching means you can't change G without effecting T, which makes it of limited use.

So because (2) is so limited, it makes us think "how do we introduce some sort of fuzziness" to matching so that not every test is invalidated as soon as G changes? There are various tree-based algorithms which could help with the fuzziness. I have seen someone implement those in the past, and it's doable, but not a 60 minute job by any means!

Yeah indeed, let me also take a few steps back, and lay out some more basic thoughts, first let me just show the original nimbleparse-lsp testing mechanism this evolved from, and it's limitations. The main toml file is here, which specifies globs akin to `%grmtools{test_files: "*.test"} and it only supports the boolean options

https://github.com/ratmice/crimson/blob/main/nimbleparse.toml https://github.com/ratmice/crimson/tree/main/test

One of the big limitations I was hoping to lift with this was the ability to encode multiple tests into a single file, using a vector of inputs. Simply because I found myself having a lot of really small input files, with silly names like "struct1.test", "struct2.test", which would be nice to have fewer files and combine into one file containing a vector of test inputs named "struct.test".

From there the next logical step from that are also encoding the expected parse tree, and the expected error if any. But I would expect while the grammar G is still changing rapidly the basic pass/fail, to kind of be the primitive but primary use case when developing a grammar because it's still useful but isn't so subject to change. In that mode it basically specifies the shape of the input, without specifying the shape of the output except as a very coarse boolean.

For embedding inputs/outputs my concern here is more that we aren't making any format choices that preclude us from encoding the expected outputs in nice way. But it was not really my focus so much as being the primary mode of use.

Another question is "how do we want users to encode these tests?" Do we want them to express them directly in Rust? Or some sort of DSL? Or both?

For something like nimbleparse_lsp it was testing occurred at every character press (within reason) of the change to the test files, or the change to the grammar/lexer, (Essentially there was a timeout which was reached when you stopped editing for a few seconds it would generate a rtparser builder, and start feeding that rtparserbuilder the test input.

I had also experimented with using catch_unwind and using an AtomicBool to communicate with ActionFn and a custom parse_actions, so that when the grammar was modified, the bool would be flipped, and after that the test run would panic/abort parsing the currently running tests.

So I would definitely prefer it not be in Rust, because that requires a whole compilation cycle which isn't really feasible at interactive speeds. Generating an RTParserBuilder is already pushing it for large grammars. What I would do was wrap the ActionFn using parse_actions using an AtomicBool to specify if we should abort testing due to an out of date grammar.

So I definitely was going for DSL rather than anything directly testing rust user actions, I think the latter is likely to need a significantly different design. (At the very least for nimbleparse_lsp which intends to be run in the browser, so we're trying to avoid porting the rust compiler to run in the browser and generate web-assembly which will subsequently get run in the browser)

So my focus has definitely been on DSL over rust for expressing tests.

I'm actually starting to wonder if what we're slowly groping our way towards is a "grammar testing DSL". I don't know if any thing like this exists, but we can fairly easily imagine something. Let's imagine I have a calculator grammar G; I might then write multiple tests in our imaginary DSL (which, because I'm unimaginative, I'm thinking of in lang_tester-esque syntax) like this:

input: 2+3
tree:
  Start
    Plus
      Int 2
      Int 3

So I guess my difficulty with this syntax is that it isn't really explicit about newlines is that "2+3" or "2+3\n", I assume that for multi-line inputs, it's whitespace sensitive and relies on the indentation level of the input string being indented. Presumably people may need to specify either. It is nice that it doesn't seem to induce any quoting requirements around quotation marks?

This explicit input is why I went with Ron for the first try, because "2+3\n", "2+3", and

r#"2 + 3
#"

So it has both the power of raw string for longer multi-line examples, but also explicit quoting requiring escapes which can be used depending on whichever is most convenient.

What about fuzzy matching? Well, I've deliberately used nimbleparse-ish parse-tree-indented syntax above. I think that lends itself well to very basic fm-textual wildcarding:

input: 2 + 3 * 4
tree:
  Start
    ...
      Mul
        Int 3
        Int 4

Yeah here is where I start worrying about the case where the parse tree has embedded ... like [0...5] and this accidentally triggering the wildcard match this feels like a similar issue to me as the one for explicitness and escaping of input, but because the name of tokens can contain pretty much any quoted thing. It seems like we'll also have to deal with an output tree like:

input: [0...5]
tree:
   Start
      ...
         "..."
             Int 0
             Int 5

It just seems to me like it'll end up with special cases where we need to escape to avoid considering output to be a wildcard which we can avoid with the tree matching in the case of something like:

input: "[0...5]",
tree:

Because wildcards are part of the tree format, rather than being a parameter to a terminal or non-terminal node. We no longer have to consider any potential matches occurring within the tree data that could conflict.

I'm still using ron with explicit types here, using the implicit types this would reduce to something like:

input: "[0...5]",
tree: 
   ("Start", 
         Wildcard(
              ("...", [
                   ("Int", 0),
                   ("Int", 5)
              ])))

Using a DSL we can replace Wildcard with the ellipses while retaining the same typing discipline which keeps wildcards out of the data stream.

   ("Start", 
         ...
              ("...", [
                   ("Int", 0),
                   ("Int", 5)
              ]))

I guess my concern is that if we treat tree: as a serialized string, and do fuzzy matching on that string. We now have to deal with all the co-mingling of data and match signal, since there is now a bunch of corner cases if we treat it like so, since it isn't well specified whether wildcards are allowed or disallowed inside/outside the quotes anymore.

input: "[0...5]",
tree: r#"
   Start
      ...
         "..."
             Int 0
             Int 5
#"

Hopefully that goes into more detail at least about my whole hang-up on this whole approach. Sorry that it is extremely long :cry:

ratmice avatar Jun 26 '25 19:06 ratmice

Presumably people may need to specify either. It is nice that it doesn't seem to induce any quoting requirements around quotation marks?

Sorry, you're the victim of my laziness and haste. I was assuming that we format terminals quoted (and escaped as necessary). Then we can allow the user to specify their own wildcard if we want, though I'm not even sure if ... is/should be a legal production name in yacc/grmtools. We could for example use curly brackets to express nesting levels to avoid problems with indentation. We could have one operator for "skip but stay on the same level of nesting" and another for "skip and ignore any increases in indentation". And I'm definitely a big fan of having multiple tests in one file too!

ltratt avatar Jun 28 '25 21:06 ltratt

Makes sense, Initially I guess my hesitation to doing a custom format was just that the rust raw string format isn't something we can easily replicate with lrlex, because it's very context sensitive (even the limited context sensitivity given by start states is not enough), so if we want raw strings that will require a manual lexer.

Above and beyond that I'm not certain we can rely on lrpar without introducing any bootstrapping issues, I think since it is in lrlex (CTLexerBuilder) and nimbleparse, and both already depend on lrpar we can probably depend on an lrpar defined grammar for the test DSL without any actual problems, that is my hope at least!

Anyhow I think I have a good enough idea now that I should probably be able to start playing with implementations.

ratmice avatar Jun 29 '25 00:06 ratmice

Bootstrapping introduces all sorts of fun and games. I'm not sure I'd advocate for it TBH if there's even the slightest whiff of problems! For such a simple DSL it isn't too difficult to write a recursive descent parser (a la cfgrammar) if neccessary.

ltratt avatar Jun 29 '25 16:06 ltratt

Yeah, indeed. I also realized that even if we did avoid raw string context sensitivity we probably still couldn't use lrlex, just because of bootstrapping it would introduce a circular dependency.

ratmice avatar Jun 29 '25 18:06 ratmice

So coming back to this: I think I am still strongly in favour of the DSL route. The exact syntax we could endlessly bikeshed if we wanted to :) I think the crucial thing is (1) to allow multiple tests in a file (fortunately this is easy!) (2) some concept of wildcarding so that users can have tests that don't require full rewriting for every minor tweak of a grammar.

ltratt avatar Jul 03 '25 07:07 ltratt

Sorry I have been basically chewing on the yaml like syntax, I really am basically totally unfamiliar with yaml, I assume with those we would use the yaml-like vector syntax for (1) multi-tests

- input: 1+2
  tree:
    ...
       Plus
          Int 1
          Int 2

What I have a bit of difficulty wrapping my head around is we've now got multiple layers of whitespace sensitivity in the to know when the end of a vector entry, and the end of the map fields, but also in the tree: data for wildcard matching.

One of the things that bothers me a bit, is that it seems like it requires you to reindent the tree data to conform to the yaml syntax. Which could inhibit just copy/pasting tree data from nimbleparse to a test file.

Anyhow one of my thoughts has been just doing a first pass, trying to reuse the %grmtools section parser in grmtools_section.test Adding a [...] vector syntax or something (I think it already may be wanted for specifying multiple globs anyways). The hard part there I suppose is figuring out how to embed the tree data within a field.

ratmice avatar Jul 03 '25 08:07 ratmice

I am not sure I would recommend yaml TBH. It's a very horrible syntax when it comes to newlines. I think we can just use our own (e.g. in something approaching lang_tester format).

ltratt avatar Jul 03 '25 08:07 ltratt

Ahh, sorry I thought the lang_tester format was using yaml

ratmice avatar Jul 03 '25 08:07 ratmice

Anyhow, I guess my question of how we want to handle the multi-test case files remains, because I don't think the normal comma separated sequence of elements within [] works very well for this which is why I was asking about the yaml style vectors e.g.

- input: ()
  tree:
    Start
- input: a
  tree:
    Start
      A

otherwise perhaps we could throw curly braces around it and use the normal vector style?

[{
  input: ()
  tree:
    Start
}, {
  input: a
  tree:
    Start
     A
}]

ratmice avatar Jul 03 '25 10:07 ratmice

Curly brackets are fine with me!

ltratt avatar Jul 04 '25 19:07 ltratt

So, sorry I haven't gotten back to this one yet, I haven't been in a place where I could page it all back into my head, nor reasonably flush all the other stuff out. That's largely done now, and I just need to find the time/motivation to get started again on this (I always have trouble getting started). Sorry, appreciate the patience.

ratmice avatar Jul 15 '25 14:07 ratmice

Totally understood, and I for one am grateful for whatever time you're able to find, whenever that is!

ltratt avatar Jul 15 '25 14:07 ltratt

@ratmice Any luck with this one?

ltratt avatar Oct 15 '25 21:10 ltratt

Sorry, kind of been side tracked with other projects.

My plan was to leave this one for a future release, and do all the stuff needed so we could add this compatibly when the time comes, PR #596 was the first step in that, adding arrays to the grmtools section was last (I don't think there is an issue or PR for that), that one was discussed in issue https://github.com/softdevteam/grmtools/issues/595#issuecomment-3289995522

So perhaps we can close this one for now, and I'll make a PR for that other which we will need in order to complete this in the future?

ratmice avatar Oct 15 '25 21:10 ratmice

So perhaps we can close this one for now, and I'll make a PR for that other which we will need in order to complete this in the future?

I think that would then allow us to make a 0.14.0 release that doesn't have an obvious "this has to be finished and will be a breaking change" issue? If so, it sounds like a good idea: I think a new release is due!

ltratt avatar Oct 16 '25 06:10 ltratt

Yeah, that is definitely the intent.

In a future release we'll just replace the "test_files extensions beginning with grm are reserved." error from #596 with the implementation of the serialized testing (for some specific extension I'd suggest .grmtest, we can leave the error in place for other extensions beginning with .grm in case we don't get it right the first time).

Until then arrays in the %grmtools section won't be very useful since we only have one type of test_files behavior (raw input with pass/fail), which generally won't need multiple globs. But add them now keeps us from change the test_files key to allow it to have multiple globs once we do have multiple test_files behaviors.

That pretty much covers the user visible aspects, and there isn't really any API surface since it's all implementation behavior within the CTParserBuilder and nimbleparse, so I think we're in the clear in avoiding any breaking changes when adding it in the future..

I'll try to sit down and work on the array syntax over the next few days.

ratmice avatar Oct 16 '25 07:10 ratmice

Going to close this one for now, and will create a new PR w/ the stuff we discussed when I revisit this in the future.

ratmice avatar Oct 22 '25 15:10 ratmice