plutus icon indicating copy to clipboard operation
plutus copied to clipboard

Syntax of builtins

Open effectfully opened this issue 2 years ago • 1 comments

Describe the feature you'd like

Once #4738 gets merged, the classic syntax of builtins will look as follows:

  1. built-in Integer:
(con integer)
  1. built-in [Integer]:
(con (list integer))
  1. 5 of built-in Integer:
(con integer 5)
  1. [1, 2, 3] of built-in [Integer]:
(con (list integer) [1, 2, 3])

Describe alternatives you've considered

There have been two proposals to improve the syntax:

  1. render Integer as (con (integer)) instead of (con integer) and render 5 as (con (integer) 5) instead of (con integer 5). This is for consistency with rendering of constants of polymorphic built-in types
  2. make a more radical change and render [Integer] as (con list(integer)) and (Integer, Bool) as (con pair(integer,bool))

Kenneth explains:

What we're looking for is a language for writing down built-in types. We want to write down (con <type> <value>) and we want <type> to be unambiguous. To keep the parsing simple I think

(a) we should have one and only one way of writing down each type (b) white space should not be part of the syntax.

I don't like list integer because then we need to write list (list integer) and then maybe we want to allow list (integer) as well and we have two ways of writing down the same type. This seems like trying to re-use Haskell syntax, but that's making things a bit too complicated.

My proposal (and coincidentally what I use in the specification) is that a type is either an atomic type, which is just a string like integer or string, or a type application, op(t₁, ..., tₙ), where the tᵢ are types. This is completely unambiguous and very easy to parse.

We could also allow curried type applications like pair integer string, but I object to that on the grounds that then you once again need to introduce parentheses if the arguments are complex: pair (list integer) string.

Another disadvantage of this is that it's more difficult to tell where the type ends. You have to work your way along parsing types separated by white space and possibly also involving parentheses, and when you see t a b c you have to parse t and then notice that there's something else after it, and probably end up with a nested application like (((t a) b) c) which you have to disentangle later, checking that t is a type operator and that you have the right number of nested applications.

If everything's fully parenthesised then this doesn't arise: you see t , then you see ( and you know that you've got a list of types coming up, so you parse those and stop when you see the corresponding ). You've got the operator and a list of argument types, and it's easy to check that you've got the right number of arguments.

My comments:

if we make it (con (list integer)), then it's equally easy to parse, it's just space-separated rather than comma-separated and it's the same "you see ( and parse until the corresponding )" logic. I'm not sure if whitespace here is any more part of syntax then in, say, [f a b c] which is an iterated application that we use with regular type applications.

However I wouldn't mind pair(integer, bool).

effectfully avatar Jul 04 '22 15:07 effectfully

My selfish reason for liking pair(integer, bool) is that it's standard algebraic notation. I use it in the specification, where everything else looks like that too, and using something sexp-like for types everywhere instead would really mess things up. Whatever we do, I think we should keep the f(x,y,z) notation in the specification; if we do use some other notation in the concrete syntax it wouldn't be a disaster, but we'd have to specify some translation to the parenthesised notation used everywhere else, which would be a pain.

kwxm avatar Jul 05 '22 02:07 kwxm

Previous discussion from #4414 (closed as duplicate even though it's older. Sorry @jmchapman! I didn't realize there was a ticket already when I created this one).

@jmchapman:

Describe the feature you'd like

The parser and pretty printer do not properly support lists, pairs and data.

I would like to use the syntax

con (list bool) [True, False, True] and con (list (pair integer bool)) [(1, True), (500000, False), (0, True)] for lists and pairs.

@SchmErik proposes the following syntax for data which I think is nice:

(program 0.0.0 (con data {
    Map [ (ByteString #0123,   Integer 12345),
          (ByteString #456789, Integer 789453),
          (ByteString #0ABCDE, Integer 12364689486)]
}))

Note that the concrete syntax is not used on chain, is not guaranteed to be production quality, and is subject to change. For testing production code, the flat representation should be used. Concrete syntax is useful for pedagogical and exploratory purposes.

Describe alternatives you've considered

I suppose one could write something more json like inside the curly brackets for data.

@michaelpj:

I'm fine with the syntax for pairs and lists. Although it occurs to me that if we really wanted to stick to our sexp guns we could have (pair x y) and (list x y z) as term syntax (punning with the type syntax).

I don't have very good suggestions for data...

Proposed syntax needs to deal with nested examples when it comes to polymorphic data. e.g.

(con (list (list (list integer))) [[[1], []], [[], [2,3]]])

effectfully avatar Jan 31 '23 23:01 effectfully

We have reached a decision regarding syntax for constant terms. That is, terms of the form:

M \Coloneqq \dots ~ \mathtt{(con} ~ T ~ c_T\mathtt{)}~ \dots

Type tags are as follows:

T \Coloneqq \mathtt{integer} \mid \mathtt{bool} \mid \mathtt{bytestring} \mid \mathtt{string} \mid \mathtt{unit} \mid \mathtt{data} \mid \mathtt{(list} ~ T\mathtt{)} \mid \mathtt{(pair} ~ T ~ T\mathtt{)}

We define a syntactic class of constants for each type tag.

\begin{align}
c_{\mathtt{integer}}    &  \Coloneqq  \mathtt{-}?[\mathtt{0-9}]* \\
c_{\mathtt{bytestring}}  & \Coloneqq  \mathtt{\#}([\mathtt{0-9A-Fa-f}][\mathtt{0-9A-Fa-f}])* \\
c_{\mathtt{string}}     & \Coloneqq  \mathtt{''} U* \mathtt{''}  \\
c_{\mathtt{bool}}        & \Coloneqq  \mathtt{True} \mid \mathtt{False} \\
c_{\mathtt{unit}}        & \Coloneqq  \mathtt{()} \\
c_{\mathtt{data}}       & \Coloneqq  \mathtt{(Constr} ~ c_{\mathtt{integer}} ~ c_\mathtt{(list~data)} \mathtt{)} \\
               &  \mid \mathtt{(Map} ~ c_\mathtt{(list ~(pair ~data ~data))}\mathtt{)} \\
               &  \mid \mathtt{(List} ~ c_\mathtt{(list~ data)}\mathtt{)} \\
               &  \mid \mathtt{(I} ~ c_\mathtt{integer}\mathtt{)} \\
               &  \mid \mathtt{(B} ~ c_\mathtt{bytestring}\mathtt{)} \\
c_{\mathtt{(pair} ~\alpha~\beta\mathtt{)}} & ::= \mathtt{(} c_\alpha \mathtt{,} c_\beta \mathtt{)} \\
c_{\mathtt{(list} ~ \alpha\mathtt{)}}    & ::= \mathtt{[} (c_\alpha (\mathtt{ ,}  c_\alpha)*)?\mathtt{ ]}
\end{align}

where $U$ is a unicode character. Note that for lists and pairs we are actually defining schemas ($\alpha$ and $\beta$ are meta-variables).

Some example constant terms are:

(con (list integer) [ 0 , 1 , 2 ])
(con (pair bool bytestring) (True, #012345))
(con (pair unit string) (() , "hello universe"))
(con (list (list (list integer))) [[[-1], []], [[], [2,3]]])
(con data (Constr 1 [I 2])
(con data (Map [(I 0, B #00), (I 1, B #0F)]))
(con data (List [I 0, I 1]))
(con data (I 2))
(con data (B #001A))

The syntax for pair and lists is the one supported by the current parser. The syntax for data is not implemented yet.

mjaskelioff avatar May 08 '23 13:05 mjaskelioff

@mjaskelioff fantastic, thanks a lot! Since this ticket is just for settling on the syntax, I'm closing it as done. For the implementation we have #5064.

effectfully avatar May 08 '23 13:05 effectfully

@mjaskelioff

I'm working on porting the same format to Aiken.

Is whitespace left unspecified by your spec? If not, I can't figure out the rule for spaces in lists from your examples, considering:

(con (list integer) [ 0 , 1 , 2 ])

This example wraps each item in a space on both sides, but this example:

(con (list (list (list integer))) [[[-1], []], [[], [2,3]]])

has a space before some items in lists, (, [] and , [2,3]) but not others (,3)

Is this just an artifact of providing examples by hand, or is there some rule governing this?

Quantumplation avatar Jul 02 '23 18:07 Quantumplation

Hi @Quantumplation, whitespace can be ignored, so any of the following should work:

  • (con (list integer) [1,2,3])
  • (con (list integer) [ 1 , 2 , 3])
  • (con (list integer) [1, 2, 3])

mjaskelioff avatar Jul 02 '23 22:07 mjaskelioff

Can the spec be updated with this?

waalge avatar Dec 12 '23 10:12 waalge

Yes, it should be updated. I'll do it, but I do not know when yet. Hopefully beginning of next year.

mjaskelioff avatar Dec 13 '23 17:12 mjaskelioff

@mjaskelioff The spec seems kind of contradicting the examples you provided. The spec would require the following syntax:

(con data (Constr 1 [(I 2)])
(con data (Map [((I 0), (B #00)), ((I 1), (B #0F))]))
(con data (List [(I 0), (I 1)]))
(con data (I 2))
(con data (B #001A))

But your examples show the parentheses omitted in nested data (which is good for readability). However this leaves us with an ambiguous edge case. What about builtin list or pair of data? I interpret it that "data has parentheses around its outermost occurrence" which would result in

(con (list data) [(I 0), (B #)])

But interpretations can also allow

(con (list data) [I 0, B #])

Which is ambiguous - does this denote a list of data or a plutus data list? (if you ignore the type string)

nielstron avatar Dec 13 '23 18:12 nielstron

I'm sorry but I don't understand the ambiguity. This syntax is for plutus core. What do you mean by a "plutus data list"?

mjaskelioff avatar Dec 13 '23 19:12 mjaskelioff

The examples omit parentheses around data constructors. Parenthesis around pairs cannot be omitted so you cannot interpret [I 0, B #] as a singleton list of pairs.

mjaskelioff avatar Dec 13 '23 19:12 mjaskelioff

Yes, it should be updated. I'll do it, but I do not know when yet. Hopefully beginning of next year.

@mjaskelioff This is PLT-378, which I'm planning to do next week. It kind of got forgotten about but I dug it out recently.

kwxm avatar Dec 13 '23 21:12 kwxm

I'm sorry but I don't understand the ambiguity. This syntax is for plutus core. What do you mean by a "plutus data list"?

Sorry I could have been a bit clearer here.

Short rant about the syntax The whole constant is not ambiguous, its just the part of the right. Its not clear if the right part (`[I 0, B #]`) describes a Plutus List (`data`) or a Builtin List of Plutus Data (`list data`). Non-ambiguity is not necessary here either since we have the type description. I see the point that pure data would be `List [I 0, B #]`.

The thing is just that according the syntax the parentheses should not be omittable neither in nested Plutus Data nor in Builtin Lists/Pairs containing Plutus Data. Now tools like @aiken-lang do enforce omitting the parentheses for Plutus Data in Builtin Lists while the @opshin uplc implementation requires the parentheses around the outermost Plutus Data occurrence. Should it be optional in any case or always omitted in nested data structures or ...? The spec could benefit from examples that cover Builtin Pair/List containing Data or details regarding the use and omitting of parentheses.

In summary the question I have is: which of the following notations is valid for nested builtin containing Plutus Data? A, B or both?

A: (con (list data) [I 0, B #])
B: (con (list data) [(I 0), (B #)])

A: (con (pair data data) (I 0, B #))
B: (con (pair data data) ((I 0), (B #)))

nielstron avatar Dec 14 '23 10:12 nielstron

The whole constant is not ambiguous, its just the part of the right. Its not clear if the right part ([I 0, B #]) describes a Plutus List (data) or a Builtin List of Plutus Data (list data).

The part of the right is not ambiguous. A data list has a List in front while list doesn't. So [I 0, B #] is a list data, while List [I 0, B #] is a data list.

mjaskelioff avatar Dec 14 '23 12:12 mjaskelioff

Okay. I also just realized that there is a uplc distributable as part of the releases now. The binary clarifies that A from above is the only valid notation (B is rejected and A is accepted by the uplc binary).

nielstron avatar Dec 14 '23 13:12 nielstron

The whole constant is not ambiguous, its just the part of the right. Its not clear if the right part ([I 0, B #]) describes a Plutus List (data) or a Builtin List of Plutus Data (list data).

The part of the right is not ambiguous. A data list has a List in front while list doesn't. So [I 0, B #] is a list data, while List [I 0, B #] is a data list.

I think we may change this in the parser soon. PR #5692 updates the specification to include the concrete syntax for lists, pairs, and data, and when doing this we realised that there is in fact some inconsistency there. When you write a data object as part of a constant it has to be enclosed in parentheses, like (con data (I 5)). When you write one as part of a built-in list the parser requires you to leave them out, so you have to write (con (list data) [I 5, B #A5F3]). The specification now says that data items always have to be enclosed in parentheses, so you'll have to write (con (list data) [(I 5), (B #A5F3)]), which is a bit annoying. We'll come back to this after the holidays and either tighten up the parser to require the parentheses or change the specification to be a bit more relaxed about where parentheses are required.

Sorry for the confusion! If we'd added the syntax to the specification earlier we might have noticed this and fixed it a while ago.

kwxm avatar Dec 22 '23 13:12 kwxm

Has the "lets make a sexpr compliant spec" ship sailed? :pleading_face:

waalge avatar Dec 22 '23 13:12 waalge