spec icon indicating copy to clipboard operation
spec copied to clipboard

New testsuite organization

Open andreaTP opened this issue 5 months ago • 17 comments

Hello 👋

I’m following up here as suggested in this conversation.

Before the PR landed, we didn’t notice that a new testsuite folder structure was introduced. As runtime developers, we now face a challenge: it’s much harder to implement proposals one at a time.

I’m looking for guidance on how to proceed and what to expect going forward.

In chicory, we generate unit tests based on .wast files. Previously, we relied on the folder structure to include only the proposals implemented at each stage. Now that everything is under wasm-3.0, we either need a way to separate files by proposal or provide at least a minimal implementation of all proposals opcodes upfront just to get parsing to pass.

I also noticed that in some cases (e.g. binary.wast and binary0.wast) files have been duplicated. Would it be an acceptable middle ground to keep proposals separated at the file level and use more descriptive names, such as binary_eh.wast, to indicate which proposal they cover?

Thanks!

andreaTP avatar Jul 23 '25 16:07 andreaTP

The wasm-3.0 branch (which will be merged into main very soon in one form or the other) doesn't actually change the test structure, it merely has additional proposals merged. Some are in their own directory, some aren't. Unfortunately, the CG did not create any guidelines regarding that, so it was left to the proposal champions to organise themselves. It would make sense to me, though, to e.g. move all tests related to exception handling into a new subdirectory.

In general, however, it is unrealistic that we can divide everything strictly by proposal — for example, because there are tests that only live in the cross product of multiple proposals. Moreover, proposals are ultimately just historical accident and less relevant or useful as a structuring mechanism for future implementers wanting to implement a Wasm engine from scratch. Structuring by proposal in fact makes it much more difficult to find all tests for a given instruction, for example (one reason why I'm not happy with the duplicated tests under multi-memory).

For any incremental implementation work you're gonna need a more fine-grained mechanism anyway. In the past we have briefly discussed some ways of naming or marking tests with annotations, but that discussion didn't go far. Maybe it's time to restart such a discussion.

rossberg avatar Jul 23 '25 18:07 rossberg

It might be possible to add "guards" to the wast, e.g. (if-supports "multi-memory" (module ...)). In SpiderMonkey we generate JS from the wast, and then patch many of the JS files afterwards to add e.g. if (wasmMultiMemorySupported()) around anything that can fail in certain modes.

I think this approach might scale better than organizing by files, since the overlap of proposals makes it all but impossible to organize all the tests into a nice tree.

bvisness avatar Jul 23 '25 18:07 bvisness

I moved exceptions and memory64 tests to their own directory in #1951.

rossberg avatar Jul 23 '25 18:07 rossberg

 it is unrealistic that we can divide everything strictly by proposal

Right, and we have mechanisms in place to exclude even single assertions, but having cross cutting concerns, for example, isolated in fewer files(as opposed to possibly everywhere) is still immensely helpful.

proposals are ultimately just historical accident

This is absolutely right, but, from a pragmatic point of view that's the only viable separation we can leverage as of today.

It might not be optimal, correct or anything, but makes it possible to easily split large chunks of work while working on a runtime.

Unless we have an(any!) alternative, I'm worried that giving up on this point is going to, just, significantly rise the bar for new runtime to be implemented.

At this historical point, and in absence of real alternatives, I believe that "proposals" are a "good enough" way to draw lines and ease implementors.

At the end of the day, they are single bodies of work from the spec standpoint and the connection clearly shows.

The fact that it maps 1:1 the evolution of the reference interpreter has also demonstrated to be very helpful during implementation.

For any incremental implementation work you're gonna need a more fine-grained mechanism anyway. 

Again, very correct, but we are(often) able to deal with the real world intricacies given a baseline.

I'm worried to lose an "emerged" baseline while aiming for a perfect solution. On this particular aspect I'd favor any quick and dirty pragmatic approach over long term plans.

andreaTP avatar Jul 23 '25 19:07 andreaTP

@bvisness, guards would make sense to me, though since it's meta-information, I'd use annotation syntax for that: (@if-supports "multi-memory") (module ...).

But yeah, marking up the entire test suite would be a substantial effort. Somebody would have to volunteer to do that.

@andreaTP, the current organisation structure is primarily by construct, although there are historical exceptions to that. If we could clean up some of the problematic exceptions, then I think that's an even better and more flexible approximation. Since most constructs can be grouped by proposal, it should at least be no worse.

The tricky bits are the cross-cutting extensions, as you mention. These are primarily things like multi-value, multi-memory/table, memory/table-64, and new value types. The former three are perhaps best treated as fundamental at this point, with no attempt of separating them out. But new value types and their use in various constructs will be a repeated issue that I'm not sure about how to organise best, because it's an NxM problem. (And we already suck badly at filling that NxM test matrix!)

rossberg avatar Jul 24 '25 08:07 rossberg

Adding annotation syntax (@if-supports "multi-memory") (module ...) to the wast files seems appropriate, probably it can be 90% generated by LLM after a few initial examples. Guiding the LLM implementation with what is already in SpiderMonkey is going to be smooth for someone that knows where to look.

andreaTP avatar Jul 24 '25 09:07 andreaTP

I've actually also been thinking about this lately, in the context of importing spec tests into WPT. WPT also likes to have functionality organized by "feature" and of course browsers typically do implement features one proposal at a time, because they aren't usually starting from scratch. In the absence of annotating the whole test suite, I was noting that most of the individual wast files target a specific proposal, and was thinking that encoding that file->proposal association somewhere, perhaps combined with duplicating some tests in case of overlap (and handling one-offs where tests are superseded) would be sufficient for that use case.

An alternative to annotation syntax could be some kind of out-of-band file (JSON or something) that could associate whole files and individual tests to constructs and/or proposals could be less work than annotating each module, but it would be less robust to e.g. insertion of new tests and easy to forget to update.

So all that is to say, I'd be in favor of reorganizing and annotating the tests by proposal, and willing to help out. Having a directory for each proposal would be nice, but if there's a way to annotate each module, that seems fine to me too.

Some specific thoughts:

Multi-memory seems like it should be easy enough to separate (and the fact that it is relatively recent and still missing from many implementations, and that other tests and proposals tend not to depend on it), that I think we should just separate it, even if it's within the same file as single-memory tests.

Likewise for memory64, actually.

Multi-value seems more fundamental. For the cases where one proposal actually depends on another, then it's obviously necessary for the tests to do so too, but ideally there wouldn't be tests that use another proposal where there's no actual dependence. I'm willing to be pragmatic on this though and if there are lots of unrelated tests that use multi-value I would be ok with keeping it as fundamental.

If we're going way back to the beginning, mutable globals is another one that's very fundamental.

For the NxM cases, maybe it would be sufficient to add multiple @if-supports annotations to a module that tests the intersection of 2 proposals (and maybe put it in its own file and/or in a common directory or something)? For cases where there is a true dependence between proposals, I don't think it's really necessary to do anything (e.g. no need to annotate all of the GC tests with @if-supports "reference-types").

dschuff avatar Jul 24 '25 21:07 dschuff

Thought I'd add my two cents as chasm is also butting up against this problem

I would like to add wasm 3.0 to the test suites I test against but chasm doesn't support memory64 (in the execution phase). The memory64 proposal has updated legacy test scripts like table_file.wast and added table64 tests rather than creating say table64_fill.wast which largely has been the convention with proposals in the past. This makes it difficult for testsuite automation as ideally I would just exclude a glob of files **64** whereas now I'd have to build tooling around per assertion exclusion and hope all the tests have a common string I could ignore. Would it be possible (as a short term solution) to pull these tests out into file with table64 in the name or a separate dir?

I also like the idea of adding more context to wast syntax (@if-supports "multi-memory"), this would be an ideal long term solution 👍🏼

CharlieTap avatar Aug 09 '25 19:08 CharlieTap

The memory64 proposal has updated legacy test scripts like table_file.wast and added table64 tests rather than creating say table64_fill.wast which largely has been the convention with proposals in the past.

I'm not sure about the "largely has been the convention with proposals in the past" part. When proposals are cross cutting (as memory64 is) isn't it more common to update the existing test than to make copies (i.e. duplicate tests) and add a suffix?

If we look at the list of finished proposals at https://github.com/WebAssembly/proposals/blob/main/finished-proposals.md I can't find very many that have their own test file suffixes. i.e. I don't see any .._non_trapping.wat or _extented_const.wat files in the test directory.

sbc100 avatar Aug 11 '25 14:08 sbc100

The memory64 proposal has updated legacy test scripts like table_file.wast and added table64 tests rather than creating say table64_fill.wast which largely has been the convention with proposals in the past.

I'm not sure about the "largely has been the convention with proposals in the past" part. When proposals are cross cutting (as memory64 is) isn't it more common to update the existing test than to make copies (i.e. duplicate tests) and add a suffix?

If we look at the list of finished proposals at https://github.com/WebAssembly/proposals/blob/main/finished-proposals.md I can't find very many that have their own test file suffixes. i.e. I don't see any .._non_trapping.wat or _extented_const.wat files in the test directory.

Yeah thats fair, I guess the proposals I've implemented most recently are front of mind and those (like gc) have featured that separation.

CharlieTap avatar Aug 15 '25 18:08 CharlieTap

I took a shot at separating the memory instructions, because not all implementations support memory64 and/or multi-memory yet (even major ones), and I could easily imagine use cases where an implementation wouldn't want one or both of those at all. It's not perfect yet (bulk memory isn't totally separated... although I myself wouldn't mind considering that fundamental, since the use cases I can think of that wouldn't want the other 2 could still use bulk memory).

Maybe for NxM cases we could both take advantage of the order of integration of the proposals and use an @if-supports attribute. In other words, each new proposal would include tests that cover its interaction with previous features, guarded by @if-supports as appropriate. Imposing an ordering (even if it's arbitrary) would provide a canonical test location/directory for tests that include each new feature F as it goes in, along with combinations with existing features. Then as following proposals come in, those would include tests for F as well, even if the new features don't strictly depend on F. This could avoid an explosion of separate files for each MxN or stuffing everything in a common directory.

For cases like a new value type though, it still wouldn't be great (e.g. you'd have to duplicate tests for existing constructs into a new file to apply them to the new type). But maybe that's still better than having a new proposal just add new test cases (which would in any case likely be duplicates) to an arbitrary number of existing test files?

dschuff avatar Aug 23 '25 00:08 dschuff

For an MxN situation, I figure you have to either do one file per M or one file per N, so either way a test author will be frequently jumping from file to file. (Or put everything in a single file.) I don't think it's that big a deal, and I like what you're suggesting so far.

bvisness avatar Aug 23 '25 01:08 bvisness

#1964 is (I think) now complete for splitting bulk-memory, memory64, and multi-memory, which are related enough that it seems like a reasonable test. The PR is fairly straightforward, although it does touch a lot of files. In this case, I put bulk memory "lower" on the dependency chain, so even though e.g. memory64 doesn't strictly depend on bulk memory, the tests that cover both are just in the memory64 directory. That decision was based partly on intuition about how "fundamental" the feature is, and partly on the idea that even an implementation that might not want memory64 or multimemory would likely still want bulk memory. In some cases it made sense to just split test files apart, and in some cases it made sense to build on tests (e.g. the generated bulk-memory tests have a "base" version and then a memory64 version that is a superset of the base version), but it's too not much duplication. And there were a few cases where there was already duplication that I deleted. Also I discovered that we don't have any tests that cover the intersection of multi-memory and memory64, so it seems like that would make sense to add. Those 2 proposals in particular don't have any true dependency between them so the test could go in either one of the directories.

dschuff avatar Aug 28 '25 20:08 dschuff

Sounds okay as a low-key solution. To be honest, anything more formal (like annotating tests with "features") would only make sense to me if we adopted respective profiles that properly spec out the precise feature subsets.

rossberg avatar Sep 02 '25 09:09 rossberg

I totally agree that from a purely spec-focused perspective, the order of feature standardization, and the order of implementation (in engines that do intend to support them) is somewhat arbitrary. But because the proposal is in practice the granularity of implementation and features are implemented slowly over time, there is still real value both for the implementers and developers targeting them to have the test suite reflect that granularity to as great an extent as is reasonable (while leaving some room to debate how much is reasonable).

For embeddings that don't intend to support all features, it gets more murky. We should probably start up the profiles discussion again; I suspect that there will be some differences in opinion about, for example how many different subsets there should be.

For now I'll get #1964 landed and start looking at where else it would be practical to move existing tests around.

dschuff avatar Sep 08 '25 20:09 dschuff

@dschuff Thank you for the reorg, whilst migrating my runtime to the latest testsuite I noticed some stragglers that still mix 64 bit address space:

  • Failure(command=ModuleCommand(filename=binary-leb128.73.wasm, line=881, name=null, type=Binary, binaryFilename=null), reason=Failed to instantiate module: ExecutionError(error=UnsupportedMemory64Module))
  • Failure(command=ModuleCommand(filename=table.23.wasm, line=50, name=null, type=Binary, binaryFilename=null), reason=Failed to instantiate module: ExecutionError(error=UnsupportedMemory64Module))
  • Failure(command=ModuleCommand(filename=table_copy_mixed.0.wasm, line=2, name=null, type=Binary, binaryFilename=null), reason=Failed to instantiate module: ExecutionError(error=UnsupportedMemory64Module))
  • Failure(command=ModuleCommand(filename=table_grow.0.wasm, line=1, name=null, type=Binary, binaryFilename=null), reason=Failed to instantiate module: ExecutionError(error=UnsupportedMemory64Module))

I'm using a the following pattern to exclude memory64 wast test scripts: "**/*64.wast", it would be good if we could continue the "64" postfix convention as it makes it super simple! There was one memory file that didn't match this convention which was 'memory64-imports.wast' but its not the end of the world as I've just written another condition for that.

also happy to lend a hand if you're busy!

CharlieTap avatar Sep 18 '25 11:09 CharlieTap

Thanks for flagging this! I probably won't be able to get to this myself until next week, but I should be able to review if you want to send a PR (on the main branch now 🎉 ). Probably for memory64 now it's best to filter on tests in the memory64 directory rather than with a particular name suffix. I don't necessarily mind renaming the one file that doesn't match the rest, though.

dschuff avatar Sep 18 '25 16:09 dschuff