cardano-ledger Problematic cost model serialization

The current cost model serialization scheme is problematic: https://github.com/input-output-hk/cardano-ledger/blob/2505b7f103c78ee5a230b3e302e31d9607fffbd5/eras/babbage/test-suite/cddl-files/babbage.cddl#L324-L327

The are two problems:

We cannot add a V3 cost model in the same protocol parameter update that initiates a hard fork. We can only count on nodes to be updated after a hard fork (chainChecks provides this fantastic guarantee), and so nodes that have not updated would not be able to deserialize a V3 cost model. This is operationally annoying for us, and also means we have to wait an extra epoch to use new versions of Plutus.
We cannot add new fields to existing cost models.

We need a more flexible serialization scheme that address these two problems. Note that the ledger deserializes the CBOR list of integers into a map by using keys provided to us by the Plutus library (where the list is assumed to be ordered corresponding to the alphabetical sorting of the key).

Jul 08 '22 18:07 JaredCorduan

I talked with @Soupstraw , @bezirg, and @michaelpj, we made this plan:

The Plan

The Plutus function mkEvaluationContext will be changed to take [Integer] instead of Map Text Integer (matching the wire spec)
- If mkEvaluationContext receives fewer integers than it needs (for a given version of Plutus), it will return an error message
- If mkEvaluationContext receives at least as many integers as it needs, it will return an evaluation context, possibly with a warning that too many integers were supplied and ignored
The CostModel serialization will be made more permissive.
- It will accept any [int]
- It will store whatever [int] given to it, and also the results of mkEvaluationContext
  - If it is for a version of Plutus that the ledger is unaware of, it will not call mkEvaluationContext but instead store an appropriate error
- Note that we need to figure out a way to become more permissive in the cost models serialization scheme only at a hard fork boundary, in order to not risk splitting the network. Cost models can currently only be changed by way of the governance mechanism, however, so if we trust that all V1 cost models updates prior to a hard fork have exactly 166 integers (and 175 integers for V2), we could just relax the scheme to [int] (at the risk of an operational mistake by the governance holders). In the past, we have handled such a maneuver by replacing at type with a type family, but that tends to be a fair amount of work and complicates the code. Perhaps there are other ideas?

How this works, new language built-ins

Suppose version x of the node supports Plutus V2 cost models with 10 fields. Suppose version x+1 of the node supports Plutus V2 cost models with 11 fields. Let f be the new field, let m be the current major protocol version, and suppose that Plutus V2 will not support f until m+1.

During major protocol m:

Node x sees a cost model with 11 fields, it makes an EvaluationContext with the 10 fields it knows about, but also stores a warning and the original 11 fields.
- It happily evaluates all V2 scripts, and Plutus will gracefully fail (phase 2) if it sees f (it will fail to deserialize).
Node x+1 sees a cost model with 11 fields, it makes an EvaluationContext with all of them.
- It also happily evaluates all V2 scripts, and Plutus will gracefully fail (phase 2) if it sees f (based on m).
Node x shuts down, the operator updates the software, and comes back up with a node x+1. Upon re-serialization of the ledger state, the operator is now in the case above (a normal x+1 node).

During major protocol m+1:

Node x can no longer participate in the chain, due to the ObsoleteNode error.
Node x+1 is ready to process scripts with f in V2.

How this works, new Plutus versions

Suppose version x of the node does not know anything about Plutus V3. Suppose version x+1 of the node adds support for Plutus V3. Let m be the current major protocol version, and suppose that Plutus V3 is introduced at m+1.

During major protocol m:

Node x sees a cost model for V3. It stores the cost model integers, but in place of an EvaluationContext it stores an error.
- Any V3 script will be rejected since x does not know how to deserialize V3.
Node x+1 sees a cost model for V3, it makes an EvaluationContext for it.
- Any V3 script will be rejected by the Plutus evaluator based on m.
Node x shuts down, the operator updates the software, and comes back up with a node x+1. Upon re-serialization of the ledger state, the operator is now in the case above (a normal x+1 node).

Jul 13 '22 20:07 JaredCorduan

It will store whatever [int] given to it, and also the results of mkEvaluationContext

... which might be an error. Do we need to store the underlying [Int]? I guess it's needed for re-serializing.

the ledger gives the Plutus evaluator m, and Plutus knows f isn't okay during m

This only applies for node x+1. Node x has an old Plutus which doesn't know about f at all, but that's also fine, it'll be a deserialization failure that happens in the same place.

Jul 14 '22 09:07 michaelpj

Any V3 script will be rejected for not having a cost model.

Won't it be rejected before that for simply not being recognized as a known language by node x?

Any V3 script will be rejected by the Plutus evaluator based on m

This is currently not happening: we should change this. We're currently relying on the absence of the cost model to disable a version.

Although in the scenario you mention, node x+1 will see a cost model for V3 in an update proposal, but it won't actually be applied until the HF that introduces V3, right? So it is still fine... but I'd be more comfortable with an explicit guard on our side also.

Jul 14 '22 09:07 michaelpj

Any V3 script will be rejected for not having a cost model.

Won't it be rejected before that for simply not being recognized as a known language by node x?

I agree with @michaelpj on this. The ledger calls into the plutus-ledger-api in a static way ; a specific node version can call into a fixed number of statically linked plutus versions, e.g. V1 and V2. If a newer currently uknown language version comes in e.g. V3, the node running the old software will not be statically linked with any V3.

Any V3 script will be rejected by the Plutus evaluator based on m

This is currently not happening: we should change this. We're currently relying on the absence of the cost model to disable a version.

I think we can enforce this on plutus-master by having a simple extra check inside evaluateScriptRestricting/Counting:

when passedProtocolVersion < expectedNextMajorHardForkIntroducingThisVersion
   fail("in phase2")

Although in the scenario you mention, node x+1 will see a cost model for V3 in an update proposal, but it won't actually be applied until the HF that introduces V3, right? So it is still fine... but I'd be more comfortable with an explicit guard on our side also.

Yes, an extra guard on our side can be helpful in case of mismatch of plutus code version with ledger code version

Jul 14 '22 09:07 bezirg

@JaredCorduan your logic makes sense.

Note that we need to figure out a way to become more permissive in the cost models serialization scheme only at a hard fork boundary, in order to not risk splitting the network. Cost models can currently only be changed by way of the governance mechanism, however, so if we trust that all V1 cost models updates prior to a hard fork have exactly 166 integers (and 175 integers for V2), we could just relax the scheme to [int] (at the risk of an operational mistake by the governance holders). In the past, we have handled such a maneuver by replacing at type with a type family, but that tends to be a fair amount of work and complicates the code. Perhaps there are other ideas?

The problem and this thing with the type families I don't get at all, so no ideas from my side.

Jul 14 '22 09:07 bezirg

@JaredCorduan If we add a check on our side to reject language versions that shouldn't be enabled in particular protocol versions, we have a choice about whether we include it in isScriptWellFormed or not. If we do, this will be a phase 1 failure, if we don't it will be a phase 2 failure. I think we should try and make as many things as possible phase 1 failures, so I'm inclined to do it unless you disagree?

Jul 14 '22 10:07 michaelpj

@michaelpj

a) if we make it phase1 error, we preclude submitting+executing any new-language scrips until HF. b) if we make it phase2 error, we allow submitting new-language scripts but disallow executing them until HF.

I don't know if there is a use-case to do (b).

Jul 14 '22 10:07 bezirg

I guess it maybe prevents you from submitting a reference script with a new language version before the HF? Does the ledger deserialization refuse to deserialize transactions containing newer scripts from newer languages before the HF?

Jul 14 '22 10:07 michaelpj

Do we need to store the underlying [Int]? I guess it's needed for re-serializing.

exactly

the ledger gives the Plutus evaluator m, and Plutus knows f isn't okay during m

This only applies for node x+1. Node x has an old Plutus which doesn't know about f at all, but that's also fine, it'll be a deserialization failure that happens in the same place.

good catch, thank you! I will edit The Plan.

Won't it be rejected before that for simply not being recognized as a known language by node x?

yes, another good catch! I'll edit that as well.

This is currently not happening: we should change this. We're currently relying on the absence of the cost model to disable a version. Although in the scenario you mention, node x+1 will see a cost model for V3 in an update proposal, but it won't actually be applied until the HF that introduces V3, right? So it is still fine... but I'd be more comfortable with an explicit guard on our side also.

I also much prefer if Plutus could gaurd this for us as well, otherwise it depends on the timing of the updates (ie the governance has to remember that they must never update the new cost model an epoch before the HF vote, etc).

The problem and this thing with the type families I don't get at all, so no ideas from my side.

yea, no worries, this is a problem on the ledger side

Does the ledger deserialization refuse to deserialize transactions containing newer scripts from newer languages before the HF?

That would be great. If we want to allow folks to create reference scripts in advance for unreleased languages, we would need a similar scheme for transaction outputs as we've just made for the cost models. It's the same problem: the current serialization scheme for transaction outputs does not allow for new languages in reference scripts, and we can only guarantee that folks have the new software after a hard fork. Having the isScriptWellFormed check the plutus version against the protocol version provides us a clean transition.

Let me leave future Jared some proof that the current serialization scheme is not flexible:

Babbage uses the same script type as alonzo: https://github.com/input-output-hk/cardano-ledger/blob/c32acb6e90ed89c58ef411baead4af554518ff0b/eras/babbage/impl/src/Cardano/Ledger/Babbage.hs#L204

Reference scripts are just deserialized as cbor-in-cbor scripts: https://github.com/input-output-hk/cardano-ledger/blob/c32acb6e90ed89c58ef411baead4af554518ff0b/eras/babbage/impl/src/Cardano/Ledger/Babbage/TxBody.hs#L693

Alonzo script type deserialization is not flexible wrt languages: https://github.com/input-output-hk/cardano-ledger/blob/c32acb6e90ed89c58ef411baead4af554518ff0b/eras/alonzo/impl/src/Cardano/Ledger/Alonzo/Scripts.hs#L391-L401

Jul 14 '22 13:07 JaredCorduan

I think we should try and make as many things as possible phase 1 failures

Absolutely. We can add this to our not-yet-existent list of guiding principles. (it's not dogma, everything is still open for discussion, though).

Jul 14 '22 13:07 JaredCorduan

If we want to allow folks to create reference scripts in advance for unreleased languages

I think I would actively prefer it if they can't :D

Jul 14 '22 13:07 michaelpj

If we want to allow folks to create reference scripts in advance for unreleased languages

I think I would actively prefer it if they can't :D

That's definitely my preference as well, I was just trying to tease out the implications (a weak form of a proof by contradiction :) )

Jul 14 '22 14:07 JaredCorduan

the last piece of the puzzle here will be solved by #3014 :raised_hands: (namely how to gracefully change the serialization at a hardfork boundary)

Sep 17 '22 22:09 JaredCorduan

:rofl: this was accidentally and automatically resolved by me saying:

In order to resolve #2902, we still need to

Jan 31 '23 21:01 JaredCorduan

@JaredCorduan AI at its finest. Joking, it was a regex "AI"

Feb 01 '23 12:02 bezirg

:tada:

Feb 16 '23 10:02 michaelpj

@lehins and I had a discussion on this, and the conclusion is that there are a few problems with the "How this works, new language built-ins" part, and it doesn't currently work:

A transaction proposing to update the number of fields to 11 does not deserialize due to the use of decodeCostModelFailHard.
"Node x sees a cost model with 11 fields, it makes an EvaluationContext with the 10 fields it knows about, but also stores a warning and the original 11 fields" does not appear to be implemented correctly.
Most importantly, mkEvaluationContext fails upon receiving fewer parameters than expected. At the beginning of protocol version m, there are only 10 parameters, and it only becomes 11 later. But node x+1 always expects at least 11. How can node x+1 validate the chain then?

cc @michaelpj @bezirg

Apr 16 '24 21:04 zliu41

Most importantly, mkEvaluationContext fails upon receiving fewer parameters than expected. At the beginning of protocol version m, there are only 10 parameters, and it only becomes 11 later.

So, what we have discussed in this context is I think:

We end up in the situation where the major protocol version says that e.g. PlutusV3 is allowed, but we do not yet have a cost model for it. We need this to be fine.
The only sensible behaviour in this situation is for PlutusV3 scripts to fail at evaluation time. So in the interval between the protocol version change and the cost model being installed, you just can't run any PlutusV3 scripts, which seems okay.
mkEvaluationContext therefore needs to do the following when given too few parameters:
1. Not fail at construction time
2. Fail when evaluating any script

Does that sound about right?

Apr 17 '24 09:04 michaelpj

Also it would be super if we could somehow set up a test for this. I don't know if the ledger has cross-hard-fork tests?

Apr 17 '24 09:04 michaelpj

So, what we have discussed in this context is I think:

I think your first two points are related to the case where we add a new Plutus version, but the problem here is with adding new builtins to an existing Plutus version.

mkEvaluationContext therefore needs to do the following when given too few parameters: Not fail at construction time

The solution outlined above says: "If mkEvaluationContext receives fewer integers than it needs (for a given version of Plutus), it will return an error message". And this has always been how it works.

If we change mkEvaluationContext to not fail when given too few parameters, then a problem (that @lehins brought up in the discussion) is that this would allow someone to submit a proposal reducing the number of cost model parameters, which cannot be allowed. But perhaps we can rely on the committee/DReps to reject such proposals? Either way, there seems to be a lot of work to be done to make this actually work.

Apr 17 '24 11:04 zliu41

Fail when evaluating any script

We'll also need to make sure there's no significant performance overhead if we need to check the condition at script evaluation time.

Apr 17 '24 11:04 zliu41

I think your first two points are related to the case where we add a new Plutus version, but the problem here is with adding new builtins to an existing Plutus version.

Right, so it is more subtle. Perhaps the tightest solution is to fail iff we are missing parameters that are needed for a builtin in the script at hand. But that's tricky. One way we could do it would be to set any missing parameters to MAX_VALUE, so that any attempt to use them will just blow out the budget?

Apr 17 '24 14:04 michaelpj

Ok, we might have a solution for this, which would require minimal amount of work.

The most important part of the solution is for mkEvaluationContext to never fail if it receives 233 or more parameters, regardless of the protocol version that the node is running in.

This would mean that plutus script execution could receive ExecutionContext that was build from 233 parameters at any point, regardless of presence of new primitive or the protocol version that the node is running in. That is because anyone can propose and potentially enact a CostModel update with 233 parameters for PlutusV3, even after the intra-era hardfork that will introduce new primitives.

I also suggested we implement an optional primitive like that right now, which would work if we had 234 parameters in PV9. This would actually allow us to write some tests for this behavior before we hard fork into Conway.

Apr 17 '24 22:04 lehins

By the way, The Plan above says:

It happily evaluates all V2 scripts, and Plutus will gracefully fail (phase 2) if it sees f (it will fail to deserialize).

It also happily evaluates all V2 scripts, and Plutus will gracefully fail (phase 2) if it sees f (based on m).

I believe both are phase-1 failures, not phase-2.

Apr 18 '24 21:04 zliu41

I believe both are phase-1 failures, not phase-2.

Yes, that will be either MalformedScriptWitnesses or MalformedReferenceScripts phase 1 validation failure, depending on where in the transaction the script is.

In both cases that will have to be plutus script deserialization failure

Apr 18 '24 22:04 lehins