motoko icon indicating copy to clipboard operation
motoko copied to clipboard

Don’t hash field names in IDL?

Open nomeata opened this issue 5 years ago • 8 comments

Now that we have the type destription table, it is worth considering to not hash field names.

Andreas writes:

The main thing that bothers me about using strings for field names in the actual transport format is that it would require yet another indirection through a string table to avoid possible blow-up. Besides basic size increase, that would complicate encoders further by requiring an additional pass over the type to extract and dedup field names.

Joachim replies:

The main thing that bothers me about using strings for field names in the actual transport format is that it would require yet another indirection through a string table to avoid possible blow-up.

Agreed.

Besides basic size increase, that would complicate encoders further by requiring an additional pass over the type to extract and dedup field names.

In terms of size, we are probably still much better off than all those people who use JSON or CBOR (where field names are just repeated every time).

As long as the field name table is not required to be sorted, it could be created in the same pass that creates the type table.

With my IDL-implementer hat on, I think I’d say the extra complexity isn’t out of proportion to what we have right now (such as preparing the type table, including recursion and deduplication there). And I wouldn’t mind dealing with that to get the benefits of this change (better debugability, no worries about IDL hash collision affecting the Motoko type system, no worries about hash collisions in higher-order use).

If the field name table is not sorted, the question arises whether we want records to be sorted by field name, field name index, or nothing. I’d prefer “by field name”, as that makes the decoding easier (I know in which order I have to fill my fields; easier to spot and reject duplicates).

nomeata avatar Jan 07 '20 09:01 nomeata

Another minor point is that field label is either string or nat. We will need another extra bit to distinguish these two cases.

chenyan-dfinity avatar Jan 07 '20 19:01 chenyan-dfinity

Ah, right. I believe we have nat there in the first place for two reasons

  1. field names may be hashes
  2. ordered, unnamed fields, i.e. tuples

If the first reason goes away, is the second strong enough to warrant that complication? There are various options:

  1. Indeed have fields labels be of type nat + string.
  2. Encode tuple fields using strings (e.g. "1" or "field1" or something like that) instead of naturals.
  3. Don't treat tuples like records (and maybe instead make the sequences that we have as function arguments/results first-class). Additional benefit: We no longer have these odd “record/tuple” mixes that are (I believe) quite unnatural in almost all host langauges.

nomeata avatar Jan 07 '20 22:01 nomeata

BTW, the size may actually be smaller in some cases.

For example, if you have a field name name that appears in n types. The leb128-encoding of the 32bit number H("name") is likely 5 bytes long, so that’s n * 5. But the leb123 encoding of the string table index is likely only 1 byte. The label increases the string by 5 bytes. So 5 + 2*n. So if n ≥ 2, the string table based variant is actually smaller.

nomeata avatar Jan 07 '20 22:01 nomeata

I want to argue in favor of using the field names, not their hashes, in the Candid data streams that we produce and consume.

(Thanks to @nomeata for encouraging me to contribute these thoughts.)

Today, we need a Candid definition file to reverse the hashes and decode the field names of raw Candid data.

However, it would be much simpler to write powerful tools that work (generically) over Candid data if this step were not necessary: The solution of having the Candid definition file doesn't help tools that see raw data in contexts where those definitions are missing (e.g., a log of all messages in a complex, multi-canister system).

To be maximally useful, such tools need to:

  • display human-readable field names to the user, to print in logs, in errors, etc,
  • run without additional context about the service type associated with this data (in fact, there may be several).

Further, if we move in this direction (extra bonuses that become reachable):

  • make the use cases for Candid more like those of JSON today, where JSON is arguably misused to represent structured data that could be represented in Candid, especially if we make this change.
  • future Candid tools could use these names for generic queries (in the style of GraphQL, or other query languages for JSON-like data), especially when we begin to generate corpus of messages from tests.

matthewhammer avatar Jul 22 '20 16:07 matthewhammer

Just another occurrence of this. In https://github.com/dfinity-lab/dfinity/pull/4812 I ran into

Public Spec acceptance tests
  canister lifecycle:                    FAIL (9.93s)
    Is running? (3.12s)
      src/IC/Test/Spec.hs:1091:
      Candid decoding error: Unexpected variant tag 3099235807

because the implemenation was using the wrong case (Stopped rather than stopped). I guess correctly right away, but the debugging experience would be nicer if the error message could say which tag was actually there.

nomeata avatar Aug 05 '20 11:08 nomeata

~~Not against the idea, but if the decoder takes the Candid type, it can output the expected field name, see the JS implementation: https://github.com/dfinity-lab/sdk/blob/master/src/agent/javascript/src/idl.ts#L802~~

Never mind, this doesn't work for variant.

chenyan-dfinity avatar Aug 05 '20 18:08 chenyan-dfinity

I wrote that idea down somewhere else a while ago, but probably should be here too:

We could include field names in a backward compatible way, by introducing a convention like this: We reserve a future type which we will not use otherwise, and include a value of that type as the last argument. Old decoders will skip it (because it’s a future type), and even if they extend their parameter list using opt t, it will do the right thing. In that value we give the list of all field names present. This allows tools to build a map from hash to field name, and print things nicely.

nomeata avatar Apr 30 '21 18:04 nomeata

Pretty sure this ship has sailed. Candid records/variants use hashes as field names.

@crusso Good to close?

christoph-dfinity avatar Apr 15 '25 07:04 christoph-dfinity