wabt icon indicating copy to clipboard operation
wabt copied to clipboard

`--no-check` produces malformed module

Open Janrupf opened this issue 4 months ago • 4 comments

This is effectively moved from https://github.com/WebAssembly/testsuite/issues/132 to here.

In short, compiling the following module with wat2wasm and --no-check:

(module 
  (func (export "test") 
    (data.drop 0))) 

leads to a WASM module that contains no data count section. According to the official specification, if a module uses data.drop or memory.init, it needs to contain a data count section.

Arguably this entire situation is bogus simply because we need to compile with --no-check for this to even compile. Interestingly enough though this seems to be only situation where --no-check leads a malformed and not an invalid module. In this case that causes some confusion when strictly interpreting the testsuite test cases w.r.t. if a module should fail to parse or fail to validate.

Janrupf avatar Aug 31 '25 15:08 Janrupf

Short answer: WABT's wasm2wat and wat2wasm generally enforce validitiy (they both assemble an IR in RAM, attempt to validate it, and fail the whole operation if validation fails). The --no-check is basically a "yolo" mode of operation -- I don't think there's much of a promise about the output. If you want the better behavior on invalid modules, you can use wasm-tools's wasm2wat/wat2wasm/wast2json.

Long answer: I agree that (as you originally suggested at https://github.com/WebAssembly/testsuite/issues/132) wast2json is behaving incorrectly on this example: the written-out module should have a DataCount section and be well-formed. WABT's BinaryWriter is handling this incorrectly because it only writes a DataCount section if the module has at least one data segment (which is the correct condition if the module is valid). The reason these kinds of test pass in the testsuite is that although WABT's testsuite is careful to make sure that assert_malformed tests are detected only by the reader (without requiring validation), it's not careful in the opposite direction: it doesn't require that assert_invalid test cases first parse as well-formed. So you're right that wast2json is writing out a malformed module here and yet spectest-interp is passing it as an assert_invalid test, and that's not pedantically correct.

wasm-tools is better about this; wasm-tools's wasm2wat and wat2wasm work on all well-formed modules (valid or invalid) and don't validate. Like WABT, the wasm-tools testsuite makes sure that assert_malformed tests are detected only by the reader (without requiring validation), but wasm-tools is also checked to make sure that every well-formed test case (valid or not) successfully roundtrips through the writer and reader, and it's also careful to make sure that assert_invalid tests first parse as well-formed. wasm-tools's wast2json tool doesn't have the same problem; the module gets written out with a DataCount section.

On the other hand, WABT's parser is the only one of the two that detects this particular kind of malformedness as malformedness. wasm-tools relaxes this requirement and catches it at validation-time instead (see https://github.com/bytecodealliance/wasm-tools/blob/main/src/bin/wasm-tools/wast.rs#L210-L254 and https://github.com/bytecodealliance/wasm-tools/pull/2134 for the discussion). So I'm not sure there's anything testing that wasm-tools's wast2json actually writes well-formed output for this assert_invalid test either.

So my summary would be:

  • yes, there is a bug in WABT's wast2json that comes from an issue in the binary writer when writing out invalid modules (modules that don't have any data segments but do have a
  • if you just want the correct behavior, you can use wasm-tools's wast2json
  • if you want to fix the bug in WABT, I think we have to modify the IR so that has_data_segment_instruction is tracked at read-time and stored in the IR (because right now it's only the binary writer that tracks has_data_segment_instruction_ and it only learns the answer after writing the code section, which comes too late)
  • if you want to fix the bug in WABT and actually test for the pedantically correct behavior, we have to modify spectest-interp to confirm that AssertInvalid tests first parse as well-formed
  • in practice, most pracitioners generally consider this (not just invalid vs. malformed, but the binary DataCount well-formedness requirement specifically) to be among the most ignorable parts of the Wasm spec. As you see both in WABT and wasm-tools, there seems to be a sense that a little mild coloring outside the lines is tolerable or at least that DataCount is a bit unfortunate. For the major engines (or anybody not particularly interested in generating, parsing, or roundtripping through the text format), I think there's even less attention to this area because I don't think those testsuites feel a need to differentiate between malformed and invalid at all.

keithw avatar Aug 31 '25 17:08 keithw

Thanks for the detailed response, really appreciate it!

To be frank, I was quite confused why the spec handles it this way to begin with. At least on paper there is nothing that would prevent the module from parsing on a binary level, thus I would have expected this to be a validation stage step either way. But oh well. I guess I was just really pedantic when implementing the test harness for my engine

Janrupf avatar Sep 01 '25 21:09 Janrupf

The basic reason is to permit single-pass validation of the binary format. The DataCount section was added by the bulk-memory-operations proposal; it comes before the code section, which comes before the data section. Without knowing the DataCount, a validator can't validate a dataidx in an instruction in the code section.

It has to be a requirement of the binary format syntax (not of validation) because there's no representation of the DataCount in the abstract syntax (beyond... the count of data segments), and validation is specified to operate on the abstract syntax.

keithw avatar Sep 02 '25 05:09 keithw

What @keithw said. The data count section is a detail of the binary format with no equivalent in the text format. As such, it cannot be represented in the AST or handled during validation. The data count section is to the data section what the function section is to the code section. In both cases, their numbers are matched up as part of decoding.

The difference between malformed and invalid doesn't really matter for most implementations, though. Only tools that want to round-trip between binary and text format or otherwise handle unvalidated code, need to care about the distinction.

rossberg avatar Sep 02 '25 06:09 rossberg