Dictionaries
Are dictionaries on the roadmap for the component model? At the moment it seems this requires representation via list<tuple<string, ...>>, where a dictionary could be a relatively straightforward host-level representation over such a datatype.
That's a good question and indeed this is one of those types on the edge. Some reasons why it seems like we should hold off for now:
- We have
recordtypes which subsume a lot of what would otherwise be dictionary use cases, particularly withoptionfields, while providing much more declarative interface information and a more-efficient representation. To wit, the Web IDLdictionarytype is effectively arecordand used extensively in Web APIs, while the Web IDLrecordtype is effectively a dictionary and used relatively infrequently. - It would be a bunch of tricky design and implementation work, since there's a wide design and implementation space of options (esp. around sortedness/uniqueness).
Thus, as with #56, while I can see the use case, I'd like to see if we can get by without these in an MVP release.
I would of course tend towards JS semantics here, so records having unique strings and ordering supported (ie exactly a list<tuple<string, ...>> representation underneath). In theory uniqueness could be managed by the host, in the same way valid USV strings are managed by the host without requiring a check everytime, since it's a strong property of the nature of the data it returns.
I can certainly get by with a list, but my concern is that it's a wide enough pattern that there's a natural temptation to want to decorate the interface with a JS object to give a nice representation - which then forces me back outside of the component model to a space of component model + sprinkles of JS, which seems non-ideal.
I suppose records & tuples alignment in JS would imply unordered record objects these days, so perhaps sorting might be preferable, it's hard to know.
Either way, agreed as with #56 that this is nice-to-have but not necessarily MVP.
Agreed. And just explaining a bit more about the design challenges:
It makes sense that when passing in a host value, the host can already have guarantees of uniqueness; the complexity comes when the Canonical ABI lifts a dictionary from linear memory since the host can't trust anything and thus must verify things dynamically every time.
For ordering: requiring ordered keys (in the list<tuple<string, ...>>) is great for lifting performance (each string can be compared with its predecessor, fused with the copy loop) but JS objects and maps (and, in general, any hash-table-based dictionary) aren't lexicographically ordered, so this would add a non-trivial sorting operation when lowering these into wasm (negating any win from knowing uniqueness). But if we drop the ordering requirement, detecting dupes when lifting linear memory is rather more expensive.
Lastly, we could have dictionary<K,V> simply mean list<tuple<K,V>> and not impose any uniqueness/ordering requirements, which would achieve your goal of being able to derive a nice idiomatic JS interface automatically. But then the worry is that dupes end up leading to corner case bugs.
I'm not sure what the right answer is here, just that it'd be nice to punt on this in the MVP.
I did not made this text pretty with ai. Main idea is how to find types which are maybe_goodtoaddtowit.
There are https://en.wikipedia.org/wiki/Refinement_type and https://en.wikipedia.org/wiki/Dependent_type which ultimately allow to encode something like JSON Shema or XSD and prove things about composability. I have not seen further more typed way to define types. And exiting example https://www.jolie-lang.org/index.html .
What types/properties for these most modern languages can do up to some degree?
It is ordered and unique.
type maybe_map = list<tuple<string, ...>> is not ordered no unique, nor there is a way to verify that. why?
because parsing maybe_map is O(size), while verifying unique is O(size * log size) which is dos attack.
That is https://protobuf.dev/programming-guides/proto3/#maps which is barely usable. What about O(const) of comparison?
Some people tend to reject idea of ordered to be any relevance to unique so https://www.ietf.org/archive/id/draft-devault-bare-11.html#section-2.2 , does not feels good. Specifically he asks to raise error if non unique, but that is dos and cannot be part of general standard.
So from this, may be need to think about ordered first, which allows unique.
We can have key ordered maps, which are not binding to list of tuples, because there is specific key value, and just ordered lists, basis for sets.
So what to do with set<tuple<MyType>>, which would imply that any primitive type as defined has canonical order, and each record has canonical order too. In general can define something. But not sure if I know all alignments. And is order ascending or descending? What order of fields in record used as key?
UPDATE: May be rule that order decided by first 2 items will work.
So having ordered can be hard, hence unique too.
So what can be done? I can document that my ``;
/// ordered
/// unique
type ordered_set<T> = list<T>
or I can make docs structured:
/// refinements:
/// - ordered
/// - unique
type ordered_set<T> = list<T>
or I can think of attributes
#[refinements(ordered, unique)]
type ordered_set<T> = list<T>
Is safe type subset possible?
Possible safe subset which can be done is ascending ordered set where element is fixed primitive type or fixed size list or tuple of such types or when key is like element in set.
How to handle not fixed elements like strings? Some trustful(by "admin"/"sudo") operation can be initialized with something to be validate by O(n log n) to be set/map and stored. And after that set or map of index elements(index of stored element) to be send as key of map/set. So
So may having structured tags or custom attributes is way to have dictionary?
https://github.com/WebAssembly/component-model/issues/58
I agree with the solution of using refinement type to enhanc dict<K,V>.
The guest language can choose the mapping based on its support for refined dict.
type dict<K, V>; // HashMap<K, V>
@predicate(key_sorted)
type sorted_string_dict = dict<string, string> // BTreeMap<String, String>
We have record types which subsume a lot of what would otherwise be dictionary use cases.
This is the wrong way to think about dictionary types. If the key space is finite, records are always preferred. Dictionaries should support the separate use case where the key space is "infinite", e.g. "a phone book", HTTP headers, JSON blobs, etc. Using the latter to emulate the former always turns into tech debt (I know JS does exactly that, but that's exactly why TS and JSON schemas are so popular), so we should think of them as disjoint use cases.
type maybe_map = list<tuple<string, ...>>is notorderednounique, nor there is a way to verify that. why? because parsing maybe_map is O(size), while verifying unique is O(size * log size) which is dos attack. That is https://protobuf.dev/programming-guides/proto3/#maps which is barely usable. What about O(const) of comparison?
This assumes B-tree or similar implementation. Hash tables are amortized Θ(n) to lift and lower and can enforce uniqueness, and they're more broadly support in standard libraries.
The canonical ABI for map<K, V> could be equivalent to list<tuple<K, V>> with overwrite on key collision. Lifting would involve, effectively:
fn lift(entries: Vec<(K, V)>) -> HashMap<K, V> {
let dictionary = HashMap::with_capacity(entries.len());
for (key, value) in entries.iter() {
dictionary.insert(key, value); // Overwrite on key collision. Uniqueness enforced. 👍
}
return dictionary;
}
I suppose the alternative would be to trap on key collisions, but I don't see why that's necessary.
Guest languages could lift them into B-trees but that would involve sorting. Sorted maps are not as standardly-supported as general associative maps and would be better represented as lists anyway, so I think the basic map type should be semantically unsorted and potentially generalize to non-string keys, like the JS Map, but could be restricted to string keys at first.
Having the capacity known ahead of time (as per canonical list ABI) means no resizing during lifting, and hashing tends to be very performant (especially for strings). If the hash function is silly, a malicious component could engineer hash collisions for unequal keys to induce O(n^2), but even MD5 would be difficult to exploit in that way at scale.
Happy to help push this boulder up the hill any way that I can.
I've also been thinking that list<tuple<K, V>>-semantics is probably the best we can do (leaving it up to the lowering wasm code to catch/handle duplicates... or not). One thing we should probably say, spec-wise, is that, just like how we non-deterministically allow but do not require NaN payloads to be scrambled (allowing, but not requiring, hosts to canonicalize NaNs at cross-component boundaries), we could likewise non-deterministically allow-but-not-require duplicate keys to be merged (with well-defined last-value-overwrites semantics when such merging occurs).
One thought in the meantime is that, K=string may (1) cover a large percentage of the use cases, (2) be much simpler to implement in bindings generators in the wide variety of associative maps out there, so it might make sense to start with just string keys.
Another thought is that the lazy lowering ABI may end up having a significant impact on the implementation, so to avoid churn, we might want to sequence this feature after that? Although if anyone wanted to do any prototyping in wit-bindgen in the short term, that'd be great too. Having heard a number of use cases and interest in this since the issue was opened, I do think this feature makes sense to include in 1.0 if folks are willing to implement it.
I'd like to propose an update to the grammar for defining map types. I can take ownership of with implementation of the proposal across the ecosystem if required.
The proposal is inspired by the Protocol Buffers (Protobuf) map type, as it is widely used, well-known, and has been tested over many years across multiple programming languages. However, while the protocol may serve as a superset, it does not strictly adhere to the Protobuf map type.
Below is the suggested format and a brief explanation of the proposed changes:
map ::= 'map' '<' map-key-type ',' ty '>'
map-key-type ::= 'string'
| 'u8' | 'u16' | 'u32' | 'u64'
| 's8' | 's16' | 's32' | 's64'
| 'char'
| 'bool'
# May be extended to support more types in the future; additions are non-breaking changes
Key Aspects:
-
Key Types: The keys of the map are limited to hashable types. This avoids the use of complex key types that would make it difficult to compare keys efficiently.
-
Value Type (ty): The value type can be any supported type, which provides flexibility for the values stored in the map. This may require some extra observation, at least in the case of protobuffer, they do not allow
mapto be a direct value. -
Key Uniqueness: Each key in the map must be unique. If the same key is inserted multiple times, the last provided value will replace any earlier values associated with that key.
-
No Ordering Guarantees: Maps do not maintain the order of elements. The key-value pairs will not be ordered, and users should not assume any specific iteration order when iterating over the map.
-
Nested Maps: Nested maps are supported, meaning the value type (
ty) can itself be a map, allowing for multi-level key-value associations. (Unsure)
- https://protobuf.dev/programming-guides/proto3/#maps
Extra
I would propose to avoid the word Dict it is quite difficult to pronounce, easy to mistype, one letter away from horrible misunderstanding, 1 more letter than map.
@yordis Thanks for posting and I'm excited to hear you're eager to prototype/implement!
The design you're proposing sounds right to me. Initially I assumed that we'd have to have the CABI enforce (via runtime checks) the absence of duplicate keys, but it just doesn't seem like there's any efficient way to do so that doesn't replicate the work already being done by the receiving guest code to construct the source-language object/dictionary/map. I also like the idea of allowing the more general map<K,V> but, at least initially, restricting via validation K to simple scalar types.
One nice thing about this approach is that we can simply say that map<K,V> is a specialization of list<tuple<K,V>>, giving map<K,V> the same ABI and thus allowing map<K,V> to be almost no work to implement in the runtime; all the work would be in the bindings generators to bind to something idiomatically map-y.
I'll wait for some more thoughts here (and take a short vacation in the interim) and then I'd be happy to write up a PR for this repo that can be used to base an implementation on.
Worth to know, https://stackoverflow.com/questions/48201347/protobuf3-why-repeated-map-is-not-allowed
Not sure if that's relevant. Repeated fields in Protobuf are different from lists in the component model. The CM supports lists of lists of lists, etc. I don't see a reason why we can't have lists of maps.
Ok, I wrote up #554; lmkwyt!