binaryen icon indicating copy to clipboard operation
binaryen copied to clipboard

Non-UTF string builtins constants?

Open kripken opened this issue 7 months ago • 9 comments

Discussing with @tlively , we thought it would be good to take the string builtins feature part of #7601 and land that, with the idea that the feature would add validation of string constants: non-utf8 ones would error.

I have code for that in sb.onlyfeat in my fork, however, I see that we can't parse such non-utf8 constants anyhow, e.g.

(module
  (import "\'" "unpaired high surrogate \ED\A0\80 " (global $bad (ref extern)))
)
$bin/wasm-opt test/lit/validation/string-builtins.wast 
Fatal: test/lit/validation/string-builtins.wast:2:52: error: expected import name

So I think that would need to be fixed first? (meanwhile, no one is at risk of trying to use such code, at least...)

kripken avatar May 27 '25 22:05 kripken

Ah, I had in mind that we would validate string.const expressions, which should be parsable with invalid unicode strings.

tlively avatar May 27 '25 22:05 tlively

I see, so to validate before lowering instead of after...

But now string.const may validate with --enable-strings but fail if the user adds --enable-string-builtins - that is, doesn't this break the rule that enabling features enables more things?

kripken avatar May 28 '25 18:05 kripken

The strings feature should make invalid unicode constants valid whether or not string-builtins is also enabled.

tlively avatar May 28 '25 18:05 tlively

I see. And either of the two features would enable string.const?

That seems possible, though two features for one IR construct is also odd. I suppose strings is just in an odd position atm, given the spec status, so no solution really feels natural.

I lean towards not changing anything here for now, and waiting until we get concrete user feedback on things they'd like improved.

kripken avatar May 29 '25 17:05 kripken

I see. And either of the two features would enable string.const?

Right. This reflects reality strictly better than the status quo, so I don't see why we wouldn't want to do it.

tlively avatar May 29 '25 18:05 tlively

Not sure what you mean by "reality" here, sorry?

I would lean to strings being the only thing that validates string consts. Then conceptually lifting is "string builtins => stringref" and lowering is the reverse, with a clean separation of features. That reflects that the two proposals are separate.

(But, again, I prefer to wait for a concrete user need here.)

kripken avatar May 29 '25 18:05 kripken

Btw, related to string opts, https://github.com/WebAssembly/binaryen/pull/7631 adds links from wasm-opt's help to the wiki, and I updated the wiki to mention string opts,

https://github.com/WebAssembly/binaryen/wiki/GC-Optimization-Guidebook#strings

If that current process is burdensome for users I am very open to changing things, but hopefully for now it is ok, and discoverable.

kripken avatar May 29 '25 18:05 kripken

Not sure what you mean by "reality" here, sorry?

I should have been more specific :) In browsers, string builtins is a different feature than stringref, so modeling them as separate features in Binaryen is more precise. When a browser supports stringref, it has no problem with non-unicode strings, but when it only supports string builtins, it cannot directly represent non-unicode strings. Reflecting that difference in our validation is more precise, and therefore more useful both to users and to us, to make sure our optimizations do not incorrectly introduce non-unicode strings. Since these changes exactly match how these features work in browsers, I don't see why we wouldn't want to implement them.

tlively avatar Jun 02 '25 11:06 tlively

In browsers, string builtins is a different feature than stringref, so modeling them as separate features in Binaryen is more precise.

But you are also saying

either of the two features would enable string.const

(that is a quote of myself, but you confirmed it summarized your position iianm).

If the features are separate, having that overlap seems quite odd. And specifically, string.const is not part of string builtins.

make sure our optimizations do not incorrectly introduce non-unicode strings.

We do have this now, in the string-lowering-asserts pass? (also the non-asserts pass will emit only unicode imports, and leave the others for the string section, which will then not validate if the user does not integrate with, so there is no risk of shipping bad strings afaict)

kripken avatar Jun 03 '25 18:06 kripken