component-model Tuples and unions have invalid names, but are the restrictions on names really justified?

Per the explainer, words can only contain the ASCII letters and numbers, and must start with a letter. Also per the explainer, tuples map to records whose field names are unprefixed integers, and similarly for unions. So that's just inconsistent. But more broadly, are these restrictions a good idea?

I may have missed something, but the only stated justification for them I can find is that binding generators almost always need to re-case the words, and "[t]he highly-restricted character set ensures that capitalization is trivial and does not require consulting Unicode tables." I think it's highly unlikely that bindings generators will be running on platforms which are not capable enough to consult Unicode tables, or need to re-case enough distinct labels that the overhead of doing so becomes significant. (Most re-casings are trivial to memoize at the word level, at least if you also note whether the word in question is the first in the identifier or not. Some don't even require that information.) The only scenario I can think of in which bindings generators need to run on (moderately) low-powered devices is for JavaScript on mobile, but Unicode tables are already required there for the Internationalization API.

There is, of course, also that many programming languages do not support non-ASCII identifiers. But many do—JavaScript, of course, but also Rust, Python, Java, C#... There are active efforts to run all of these languages in WebAssembly. And while non-ASCII identifiers are perhaps not yet in common use due to the dominance of English in computing, enforcing that dominance on a technical level is, frankly, wrong. (Or, perhaps, they're more common than I'm aware, as a native English speaker?)

If these restrictions are maintained into release, there is a chance that bindings generators will come up with a separate, likely informal convention for expressing incompatible identifiers, as well.

This argument extends to the requirement that words start with a letter—most any language permits digits after an underscore, and Rust uses bare integers for tuple fields, which I bet was the inspiration for the CM doing the same. Dealing with these issues is probably less important (modulo the inconsistency around tuples and unions), but they're close enough to the point of this issue that I think it's worth raising them at the same time.

Feb 08 '23 03:02 hatzka-nezumi

The mapping from tuples and unions into records and variants with integer names should probably be clarified to be suggestive and not a literal desugaring. E.g., because a string and list<char> will have very different language bindings in most languages, I think the two must ultimately be considered completely distinct types (in terms of type checking). Thus, a tuple would not literally turn into a record with integer-valued field names, it would simply be a tuple.

The broader question you ask is a good one to ask, but the two issues you identified do seem like potentially serious problems that it's hard to judge the extent of at this point in time. Here and with other cases in the component model, we need to take the intersection of a bunch of different language and tooling concerns. This also seems like a situation where we can start conservatively in the MVP (where there are already many other problems to consider) and relax the validation rules later based on more experience and consideration.

Feb 09 '23 02:02 lukewagner

I'll add though that I'm just not that familiar with what all the major programming languages accept as identifiers these days and maybe the situation is better than I'm assuming? I'd definitely be interested to see a survey of what the state of identifier character sets is and also what, e.g., other IDLs are doing.

Some other questions that would be super-helpful to answer:

what's the right bigger-than-ASCII Unicode Set/Category that matches the most number of these languages that do support non-ASCII identifiers?
does an "all lower case" / "all upper case" distinction still make sense or should it be "all upper case" / "does not contain any upper case" or something else based on Unicode semantics?
is there by chance a good character set choice for which smaller libraries (than ICU) would suffice to validate a component?

Feb 09 '23 03:02 lukewagner

Ah, if the mappings of specialized value types to fundamental ones are meant to be illustrative, that's a different matter.

As for what languages support what identifiers? The short version is that there's a Unicode standard for identifier characters, but nobody except C++ follows it unmodified. Also, many languages have special semantics that depend on names, whether by convention or enforced. And then there are languages with overlapping namespaces to consider. Here's the long version.

Personally, I think the best option, at least on the standards end, is to drop name mangling altogether. Instead of having restrictions on names, just say anything goes, and leave the generators to sort out whether and how they want to transform names. (It's worth noting that they would need to do so anyway, because every language has a different set of keywords, and some have semantics on capitalization, etc.)

This would have the disadvantage that imported names wouldn't necessarily line up aesthetically with idiomatic local names, but I think there is value in things having the same name everywhere, even if that name isn't idiomatic for the language you're using it from. It makes it easier to search documentation, for one thing.

It would also mean a new way to distinguish functions from instance methods from static methods from constructors would be necessary, as the current system relies on name mangling for that. If nothing else, I think it would be better (both more elegant and more practical) to define that kind of structure in a more, well, structured manner.

However, to answer your questions more directly:

"what's the right bigger-than-ASCII Unicode Set/Category that matches the most number of these languages that do support non-ASCII identifiers?" – XID_Start and XID_Continue are probably the best approximation, but again, no language other than C++ uses them without modification.

"does an "all lower case" / "all upper case" distinction still make sense or should it be "all upper case" / "does not contain any upper case" or something else based on Unicode semantics?" – Unicode has five general categories grouped under "Letter"; namely, "Letter, uppercase", "Letter, lowercase", "Letter, titlecase" (for ligatures), "Letter, modifier", and "Letter, other" (for scripts without case). The most rigid distinction is probably whether or not a word contains any uppercase or titlecase characters, but if you want to continue the current restriction that all characters in a word must have the same case, you could ban titlecase letters and count uncased letters as a third, neutral case. There are probably other reasonable systems.

"is there by chance a good character set choice for which smaller libraries (than ICU) would suffice to validate a component?" – Probably not, but I can't imagine a scenario in which component validation needs to happen on a device that wouldn't have ICU already and where resources are too constrained for adding it to be practical. Not to mention, there are probably libraries other than ICU which could be used, or you could write a specialized implementation that consumes exactly the relevant Unicode tables directly.

Feb 10 '23 04:02 hatzka-nezumi

Thanks for doing all that work to produce that great survey! It's hard to draw any black-and-white conclusions given the mix of plenty of languages that do support a wide range of Unicode characters vs. some key languages that don't including, notably, Protobuf (which is the most similar in nature to components w.r.t bindings generation). How to evaluate these tradeoffs is tricky and probably deserves more discussion.

I do think that the currently-proposed casing and static/method transformations will go a long way to making Wit APIs feel nice to use out of the box (without anyone needing to manually write wrapping glue code for each language). Moreover, if casing is totally free-form, then what we'd surely see in practice is each component using its own language's native casing scheme, leading to a rather incongruous experience when you're reusing components (and probably some unnecessary flamewars as well-meaning folks attempt to establish uniformity). So I think we shouldn't give up on the current scheme.

Feb 13 '23 23:02 lukewagner

To be clear, I don't disagree with any of that. With respect to casing, what I think is that, given the diversity of naming conventions in use in the real world, it would be better to say importers should handle it on a best-effort basis than to say that exporters must handle it precisely according to some preset rule that tries to accommodate every language, as such a rule will inevitably be a lowest common denominator.

One thing I'm concerned about is that if we don't acknowledge that importers have to detect invalid names (which is true either way), they probably won't notice it themselves, and then they'll silently generate syntactically invalid bindings when someone tries to import a function named box into Rust, for example. But if the specification does instruct importers to check every name for validity, it then becomes a little weird to also require exporters to preprocess names.

As for the static/method/constructor transformations, I'm certainly not proposing to drop them. What I'm saying is that instead of encoding whether a function is in fact part of a type as part of the function's name, we just attach them to the type itself directly.

Feb 14 '23 04:02 hatzka-nezumi

I think you're right that, inevitably, bindings generators are going to have to handle things like keyword collisions in a regular way. (I'm hoping naming collisions with global scopes can be addressed by namespacing of bindings.) I think our most effective way to encourage bindings generators to do this is to define a very precise input domain of component-level names so that the bindings generator can do an exhaustive case analysis and think holistically about the problem. If instead we leave names wide open, then I expect most bindings generators, out of rational pragmatism, will end up taking an ad hoc as-needed approach (just work out the cases they see in practice) which I expect would lead to net worse support for components that step out of the well-trodden path.

Feb 15 '23 16:02 lukewagner

I'm afraid I don't follow. All languages in common use only have ASCII in keywords, so how does forbidding non-ASCII identifiers help bindings generator authors handle keyword collisions? And if you want to "encourage" them to do things properly, surely it would be better to explicitly document best practices and perhaps include a test suite that covers common keywords, rather than leaving them to notice the problem on their own.

Feb 18 '23 04:02 hatzka-nezumi

I'm referring to the case conversion when talking about defining a precise input domain, not just keyword collisions.

Feb 18 '23 17:02 lukewagner

Ah, sorry. I took "this" (in "encourage bindings generators to do this") to mean "handle keyword collisions" for some reason. A few additional thoughts:

I would still say explicitly documenting best practices for bindings generator authors to implement would be better than artificially forcing the problem to be both easier and obligatory to deal with, especially in a way that unjustly disadvantages real people.
The bindings generation stage is where we have the most information about the naming conventions and requirements of the target language specifically, and about the conventions of the source language and intent of the author (if we make names freeform, and barring very unusual naming schemes like Julia's), so it makes sense to do as much name processing as possible there.
Much of this conversation is predicated on my assumption (which I hadn't noticed until just now) that the component model intends to support generating Wit from other languages' source code. I think that's a worthwhile goal—it'll make bringing existing libraries to the CM much faster, for one—but if the intent is for the Wit to always be handwritten, then the casing part of the question might be safe to ignore, although issues remain with non-English characters and with name mangling. (I think this assumption came from an interop framework concept I was playing with before I came across CM, in which the canonical description of an API is in the source language, not the IDL.)

That said, here's a rough idea of what those best practices might look like:

Capitalize according to your language's conventions. [if kebab-case is required] Most Wit identifiers contain hyphen-dashes, which most languages don't allow in identifiers. This is intentional; you should adjust identifiers in your bindings to match your language's conventions. For example, this-identifier in Wit might become this_identifier in Python, ThisIdentifier in C#, and so on. [if casing is freeform] Different languages have different rules and patterns for letter case in identifiers. You should detect word boundaries in identifiers and change the casing to match your language's conventions. Look for underscores between words, as well as capitals on the first letter of each word (possibly except the first word), and for strings of capital letters (indicating acronyms). Don't forget that many writing systems don't have case, [if we allow titlecase characters] and a few characters are both uppercase and lowercase.
Check your language's identifier syntax. Many languages forbid characters which Wit accepts, such as those from languages other than English, or some punctuation. If you come across a name your language can't express, you should report the issue to the user, and prompt for another name to use instead.
Check for identifiers that are keywords in your language. Every language has a different set of keywords, so eventually you'll see something whose original name is one of yours. If this happens, you should prompt the user for what to do next. Some languages have a way to escape identifiers so they don't get confused with keywords; if available, this should be an option. However, many developers aren't familiar with those features even when they exist. You should also allow the user to choose a different name for the imported item.
Check for duplicate identifiers. Even though Wit doesn't allow duplicates, some names which were distinct in Wit might become the same after processing for your language. If this happens, prompt the user on which item(s) to rename (or perhaps skip), and what name(s) to use instead.
Allow humans to replace any name if they need to. Inevitably, you'll miss something. Be prepared. And if the user does override any names, make sure to record those overrides somewhere you can find them again, so that if the Wit gets updated, the same bindings can be regenerated without additional human work (unless of course there are new incompatibilities).
If a name changed, record the original name in the documentation for the binding. This helps humans find the original documentation. It's easy enough to notice that what you're calling this_identifier might originally have been ThisIdentifier [adjust example if casing is not freeform], but it might not be obvious that what you're calling kind was originally class.

Feb 20 '23 04:02 hatzka-nezumi

I think documenting best practices for bindings generators is a great idea. Having a precisely-defined grammar for Wit supports this by allowing the documentation to enumerate all the relevant/interesting cases and point out gotchas. To be clear, I'm not saying that having a precisely-defined character set implies ASCII -- it could mean XID_Start/XID_Continue -- just that it requires spending some time to work through the options and understand what all the implications are for all the places that Wit identifiers can appear.

but if the intent is for the Wit to always be handwritten

While handwritten Wit is currently the primary workflow (e.g., written as part of WASI or platform-specific SDKs), folks have definitely talked about cases in which Wit could be inferred from (possibly annotated) source code in some cases. This would imply performing the casing in reverse (e.g., turning this_identifier in Python into this-identifier) and seems doable too (not for every possible source-code identifier, but for a meaningful subset of them that could also be precisely defined).

Check your language's identifier syntax. Many languages forbid characters which Wit accepts, such as those from languages other than English, or some punctuation. If you come across a name your language can't express, you should report the issue to the user, and prompt for another name to use instead.

On this topic, I would think that when this case is hit (which happens automatically as part of a build process), the user wouldn't want an error or to have to take any action but, rather, the bindings generator would have to use some escaping scheme to automatically map into the valid identifier space. I like your idea about documenting such cases somewhere the user can find them, though.

Feb 23 '23 23:02 lukewagner

I think "precisely defined" is the wrong phrase for this situation. We are not starting from nothing and creating a character set from scratch, we're using Unicode, which is already very precisely defined, and debating the limits to put on our subset of it. As such, I think we should be starting from "everything goes" and tightening the rules, rather than starting from "nothing goes" and loosening the rules. It's a subtle difference, though.

...by allowing the documentation to enumerate all the relevant/interesting cases and point out gotchas.

We will never be able to enumerate "all" the relevant/interesting cases, because every target language is different and has different rules. What we can do is list common types of pain point, such as keywords, punctuation, and casing style. Of course it's also impossible to complete that list for the same reason, but we will do much better saying "watch out for keywords" than "watch out for class" specifically, and having a limited syntax for identifiers doesn't especially help with the former, only the latter.

I would think that when this case is hit (which happens automatically as part of a build process), the user wouldn't want an error or to have to take any action but, rather, the bindings generator would have to use some escaping scheme to automatically map into the valid identifier space.

I'm envisioning a system where developers would run the bindings generator either manually or as part of a build process directed by humans, such that they would be there to deal with any issues that arise. Then, as they do so, their decisions are recorded both in the documentation and in a machine-readable format which the bindings generator can later reuse. This is what I meant by "make sure to record those overrides somewhere you can find them again, so that if the Wit gets updated, the same bindings can be regenerated without additional human work (unless of course there are new incompatibilities)." Later, in more automated contexts like CI where prompting a human for input isn't possible, the bindings generator could be run with a flag that turns such issues into hard errors if they've not already been dealt with.

That said, a regular escaping scheme would be a valid option for a bindings generator to include, to encourage a standard way to handle such cases without requiring it in an absolute sense. It would also be valid for a bindings generator to offer an "always take the default" setting. I'm all for flexibility here.

Feb 24 '23 03:02 hatzka-nezumi

I agree that starting with "a list of USVs" is a much better starting point than anything more open-ended. The point I was getting at is that if we subset "list of USVs" down to either the current set or, say, this XID_Start/XID_Continue set, then we are better able to provide a list of meaningfully-different cases for the characters (e.g. { lower-case-chars, upper-case-chars, no-case-chars, numbers, ??? }) and, using this, give general language recommendations for all the automatic kebab-name to source-language-name bindings so that a random component C that doesn't know anything about language L has a good chance of working when consumed via automatic bindings generated for language L, even if C is using names off the beaten path (this being a special case of the general "composability" goal of components).

As for automatic-vs-manual: while there should be nothing stopping anyone from building a manual/interactive name-mapper, my understanding is that the vast majority of consumers of components or Wit interfaces will be using an automatic bindings generator. See, e.g., the developer workflows supported by cargo component and jco based on wit-bindgen.

Feb 27 '23 18:02 lukewagner

I'm not sure what you mean by "anything more open-ended"; I was talking about more restrictive starting points, not less restrictive ones. In any case, what I'm suggesting is to draw up a list of common reasons an entire identifier might be problematic, rather than reasons individual characters might be, as at that level we can't talk about keyword collisions or duplicates.

I'm certainly not suggesting that it would be practical to prompt the user to replace every identifier. But in cases where an identifier cannot be unambiguously mapped into the target language, I think it would be cleanest to prompt the user what to do, or at least let them specify a general policy. After all, the only alternative I can see is to error out and leave the user to specify an override in a configuration file, which is basically the same thing, but slightly slower (as the user will have to reinvoke the generator, and the generator will have to start from scratch).

Also, even I've managed to forget about name mangling for the last few comments. Should I open another issue for that? I initially raised it here because loosening requirements on identifiers too much makes the current mangling scheme impractical, but on the other hand, it seems somewhat tangential outside of that.

Feb 28 '23 23:02 hatzka-nezumi

By "anything more open-ended" I was agreeing with you that starting with Unicode is better than starting with "a sequence of bytes". I don't think interactively prompting users for what to do with identifiers can be our primary/default answer; Wit is an IDL like Protobuf or OpenAPI or MIDL and, like these other IDLs, needs to provide automatic bindings generation as a default option which means the bindings generator needs an automatic way to produce a valid identifier in the source language for any valid Wit name (likely using some variant of escaping).

Mar 02 '23 00:03 lukewagner

I still think you're missing my point on both topics, but they're also both tangential, so oh well. Here's a revision of the proposed best practices from before:

Capitalize according to your language's conventions. [if kebab-case is required] Most Wit identifiers contain hyphen-dashes, which most languages don't allow in identifiers. This is intentional; you should adjust identifiers in your bindings to match your language's conventions. For example, this-identifier in Wit might become this_identifier in Python, ThisIdentifier in C#, and so on. [if casing is freeform] Different languages have different rules and patterns for letter case in identifiers. You should detect word boundaries in identifiers and change the casing to match your language's conventions. Look for underscores between words, as well as capitals on the first letter of each word (possibly except the first word), and for strings of capital letters (indicating acronyms). Don't forget that many writing systems don't have case, [if we allow titlecase characters] and a few characters are both uppercase and lowercase.
Check your language's identifier syntax. Many languages forbid characters which Wit accepts, such as [adjust as needed] those from languages other than English, numbers, or some punctuation. You'll need a strategy for dealing with these.
Check for identifiers that are keywords in your language. Every language has a different set of keywords, so eventually you'll see something whose original name is one of yours. You'll need a strategy for these, too. Some languages have a way to escape identifiers so they don't get confused with keywords; if available, this should be considered. However, many developers aren't familiar with those features even when they exist.
Check for duplicate identifiers. Even though Wit doesn't allow duplicates, some names which were distinct in Wit might become the same after processing for your language. If this happens, [I don't know what to recommend here]
Allow humans to replace any name if they need to. Inevitably, you'll miss something. Be prepared. And if the user does override any names, make sure to record those overrides somewhere you can find them again, so that if the Wit gets updated, the same bindings can be regenerated without additional human work (unless of course there are new incompatibilities).
If a name changed, record the original name in the documentation for the binding. This helps humans find the original documentation.

Mar 02 '23 23:03 hatzka-nezumi