lotus icon indicating copy to clipboard operation
lotus copied to clipboard

`eth_call` incorrectly reconciling deferred (Filecoin) and immediate (Ethereum) execution

Open raulk opened this issue 2 years ago • 13 comments

Context: https://github.com/cerc-io/stack-orchestrator/pull/509#issuecomment-1691735496

Ethereum clients and libraries expect that calling eth_call with a given epoch (e.g. 1234) will apply the call on the output state of the supplied epoch. In Ethereum, the output is computed immediately and the root hash is part of the block header of height 1234. In Filecoin, deferred execution causes the output of epoch 1234 to be calculated on the next non-null round, normally 1235 (but not necessarily).

The logic here is flawed.

  • The correct behaviour is to apply the call on the parent state root of the next tipset (execution tipset).
    • If this supplied epoch is the "head" epoch, we should error out, thus preserving consistency with the rest of the Eth module (ie. the tipset has still not been executed and is not "exposed" to Eth endpoint clients).
  • We should not be doing any trickery with individual messages (e.g. finding prior messages, filtering, etc.) down the line, at all (see callees of this function).
  • The LOTUS_SKIP_APPLY_TS_MESSAGE_CALL_WITH_GAS env var introduced in #10546 must go.
    • This causes inconsistent behaviour across RPC endpoints in the ecosystem depending on how this env var is configured.
    • When enabled, it deliberately fails to meet the required functionality / expectations of eth_call.
    • Due to a chain of circumstances, all new RPC endpoints being set up in the Filecoin ecosystem have this flag enabled.
    • Therefore, all new endpoints are failing to meet expectations.

raulk avatar Aug 24 '23 14:08 raulk

Yeah, this logic is definitely wrong.

Stebalien avatar Aug 24 '23 14:08 Stebalien

Ah, ok, I remember why we did this.

The old logic was "as correct as we could make it", at least without significant work. The issue is that:

  1. If we execute on the "final" parent state tree, we need to execute at epoch+1. Otherwise, we'll execute at an epoch where cron has already run (which means all the miner actors will be in an unexpected state which could cause issues estimating window post gas).
  2. To execute at epoch+1, we'd need a tipset at epoch+1. Otherwise, we'll have issues fetching tipset CIDs and randomness (this was causing Eth transactions to fail).

So, we went with the option of applying all messages in the specified epoch, then applying the user's message at the specified epoch.

The alternative would be to make some form of "fake" tipset and/or hack the lookback logic to make everything work even when we're executing at the wrong tipset. Which.. it sounds like we need to do.

One remaining question is: does eth_call invoke the passed message at the specified epoch, or at epoch+1?

Stebalien avatar Aug 24 '23 15:08 Stebalien

Ok, this shouldn't be difficult to fix in the ethereum case.

Stebalien avatar Aug 24 '23 15:08 Stebalien

Thanks for the extra context!

If we execute on the "final" parent state tree, we need to execute at epoch+1. Otherwise, we'll execute at an epoch where cron has already run (which means all the miner actors will be in an unexpected state which could cause issues estimating window post gas).

I see. Yes, this could lead to inconsistency, e.g. if the eth_call ends up accessing some miner or market state that has changed with the execution of cron. But we might have to bite the bullet and put a warning label on this API operation.

The alternative would be to make some form of "fake" tipset and/or hack the lookback logic to make everything work even when we're executing at the wrong tipset. Which.. it sounds like we need to do.

Are we able to achieve something like this without increasing the computational and IO cost of eth_call, though? This call is commonplace, supposed to be light, and the expected latency is low.

One remaining question is: does eth_call invoke the passed message at the specified epoch, or at epoch+1?

In Ethereum, eth_call requests are executed on top of the resulting state root of the block at the specified height, and the visible height from within EVM bytecode is the same height.

raulk avatar Aug 24 '23 16:08 raulk

So, an issue here is:

If this supplied epoch is the "head" epoch, we should error out, thus preserving consistency with the rest of the Eth module (ie. the tipset has still not been executed and is not "exposed" to Eth endpoint clients).

We currently "support" pending so we'll need to be careful not to break users.

One remaining question is: does eth_call invoke the passed message at the specified epoch, or at epoch+1?

In Ethereum, eth_call requests are executed on top of the resulting state root of the block at the specified height, and the visible height from within EVM bytecode is the same height.

Are you sure because my research indicates otherwise (although I haven't actually tested it).

Stebalien avatar Aug 24 '23 16:08 Stebalien

Here's what I currently have: https://github.com/filecoin-project/lotus/pull/10462

But see the question at the end.

Stebalien avatar Aug 24 '23 16:08 Stebalien

Are you sure because my research indicates otherwise (although I haven't actually tested it).

Yep. Here's a reference: https://github.com/ethereum/go-ethereum/blob/1a2135044cd498039be46e1b611627665ff6c4bc/internal/ethapi/api.go#L1109. This loads the state with the root specified in the block header at the requested height, which in Ethereum is the post root state. And the height fed into the EVM is the supplied height.

raulk avatar Aug 24 '23 17:08 raulk

Great. That's what I get for trusting stack overflow...

So.... we have two options:

  1. Tell people not to set that env var. That does, in fact, do the right thing from the perspective of Ethereum by applying all messages in the tipset, followed by the specified message. But it's slow.
  2. Just.... apply the message after cron but "fake" the block number? But that may mess some stuff up in miner actors with respect to deadlines. Maybe that doesn't matter for ethereum?
  3. Do 1, but cache it.

3 is, disgustingly, the most correct option at least as far as ethereum is concerned, but I really don't want to do that as it'll be an ETH JSON-RPC API specific feature.

Stebalien avatar Aug 25 '23 04:08 Stebalien

We should discuss further in slack, but we are currently working on a filecoin grant to add the indexing required to support this sort of usage. https://github.com/vulcanize/filecoin-indexing/issues/4

AFDudley avatar Aug 25 '23 13:08 AFDudley

So here's what I think the problem is:

  • According to the specifications for the Ethereum RPC API, the eth_call API is designed to execute the user's transaction against the state as it exists following the completion of the user-specified block. This implies that the state used to apply the users transaction must be the state that comes after executing all transactions contained in the block

  • Now, of course, Filecoin has deferred execution, but we still need to respect the eth_call RPC behaviour. So, here's what we do in Filecoin:

    • In the current implementation, Filecoin's eth_call API takes the parent state root of the user specified block
    • It then applies all the transactions in the user specified block on top of the parent state root to get the new state at the end of the block and then applies the user's ethcall transaction on top of it -> This gets us ETH like behaviour
    • Note 1: The Cron Actor in Filecoin is executed at the end of every epoch and the state is updated accordingly. The subsequent ParentStateRoot will reflect the result of applying the cron actor actions. But in the case of eth_call, the transaction will be applied to the end state without applying any Cron actor action. But this is fine.
    • Note2: that the LOTUS_SKIP_APPLY_TS_MESSAGE_CALL_WITH_GAS flag here can play a spoil sport for obvious reasons but this is not hard to fix. We can just ignore this flag for the eth_call API
    • All of the above works fine and as expected (please correct me if I am wrong). The real problem here is that this is slow as we're re-applying each transaction in the block
  • A better way of doing this would have been to take the parent state root of the tip set following the eth_call epoch. That state would contain the result of all the transactions in the user specified epoch and so we wouldn't have to re-execute any transactions

    • Note: There would be a semantic change here as this state will ALSO contain the result of running the cron Actor (is this bad ? Are we okay running the eth call on top of the state that comes after the cron actor is run as long as we can guarantee that we're running it at the immediate next block after the user specified block ?)
  • But now the problem here is that if there are single/multiple null tipsets after the user specified epoch, then the state on which the eth_call transaction will be applied will also contain the result of the execution of the cron actor on all those null tipsets because cron actor is executed every epoch and even for null tipsets. This means that the state on which the eth_call transaction is applied is not exactly the state at the end of the user specified epoch. That is, the final state on which the eth_call transactions is applied could contain the (state from executing the transactions in the user specified block + changes from multiple cron actor invocations at multiple null tipsets in multiple epochs) -> This is bad because it breaks ETH behaviour and could make the state hard for the user to reason about.

Here's what the solution can be

  • First of all -> ignore the LOTUS_SKIP_APPLY_TS_MESSAGE_CALL_WITH_GAS flag in the eth_call API
  • Do nothing else -> eth_call behaviour remains correct BUT slow
  • Apply the eth_callon top of the ParentStateRoot of the first non-null tipset following the user specified block -> Ship the API with a warning about cron actor state shenanigans but this should be fine as I am not sure how much the ETH API/ETH transactions care about cron actor state correctness

aarshkshah1992 avatar Apr 16 '24 05:04 aarshkshah1992

Hey @aarshkshah1992 -- this is definitely a rough edge, and it's currently coming back to haunt us at IPC 👻 (cc @aakoshh @snissn @cryptoatwill).

We're not necessarily concerned about the cron actor state per-se. In fact, the cron actor state only contains system-defined cron entries, and therefore can be deemed static from a practical viewpoint.

The problem is with all the remaining built-in actors state objects that are modified as a result of cron ticks either directly or indirectly (e.g. miner actor, storage power actor, etc.)

Earlier in this discussion I suggested to call it a day and slap a warning on eth_call, but since then I have acquired more understanding on how certain Filecoin apps like lending pools are designed, and I now suspect this can be dangerous.

To illustrate the scenario, imagine you are lending pool operator with a backend that monitors the chain for accounting or decision-making via a combination of events and eth_calls (e.g. something very common when using The Graph). This case is pretty extreme, but it exposes the correctness flaw:

  1. You are happily polling some smart contracts by calling eth_call on a contract function on every height. That contract function reports the risk level at a particular height. It does so by monitoring a bunch of facts about the miner, e.g. including its RBP on the power actor.
  2. The chain suffers a sequence of null rounds at heights 1000-1050, but you are not aware of them because you're a poor Eth client and the Eth chain has no such concept.
  3. During this blackout of null rounds, the SP had failed to PoSt and some sectors got terminated, in the cron tick at height 1025. This is enough to flip the risk level as perceived by the lending pool from "none" to "high".
  4. If we fix this by performing the eth_call on the ParentStateRoot of the first non-null round (height 1051), we will mislead the application into thinking that the miner reached the "high" risk level at epoch 1000, when it reality it reached it at epoch 1025.

Instead, I propose the following solution:

  1. Acknowledge that null rounds are a rare event, at least in mainnet. In networks where they're more frequent, they also have lower transaction volumes (e.g. Calibrationnet).
  2. When a null round is detected, fall back to the old "apply ts messages" technique. Since null rounds are rare, the fallback will be applied sparingly and we don't expect it to be problematic. We could potentially cache these roots, but probably not worth it.
  3. For everything else, use the successor tipset ParentStateRoot.

raulk avatar Apr 16 '24 17:04 raulk

So... I think there's a much bigger problem: invisible reorgs. I can call eth_call on the same tipset multiple times but get different results depending on whether or not the tipset is followed by a null round. I.e.:

  1. I have synced: A <- B.
  2. I call eth_call, and see the result of applying (msgs(a)..., cron(a), my_message).
  3. I reorg to: A <- Null <- B.
  4. I call eth_call, and see the result of applying (msgs(a)..., my_message).

This is an issue because:

  1. The tipset (and eth block hash) hasn't changed.
  2. Balances can change because cron may have fined some Miner actors.

No solution involving post-cron state will fix this, only "apply all the messages" is correct. I think we may need a lot of caching (and ideally an on-disk cache).

Stebalien avatar Apr 17 '24 19:04 Stebalien

Ah, nvm. We're all making some incorrect assumptions about cron. The ParentStateRoot does not include cron catch up. When we execute a tipset:

  1. We execute cron catch up for null rounds.
  2. Execute the tipset (current epoch)
  3. Execute one cron round (current epoch).

So:

So... I think there's a much bigger problem: invisible reorgs. I can call eth_call on the same tipset multiple times but get different results depending on whether or not the tipset is followed by a null round. I.e.:

This isn't an issue because ParentStateRoot does not depend on the number of trailing null tipsets.

If we fix this by performing the eth_call on the ParentStateRoot of the first non-null round (height 1051), we will mislead the application into thinking that the miner reached the "high" risk level at epoch 1000, when it reality it reached it at epoch 1025.

We'll blame epoch 1051 which is, IMO, correct.


I think the answer is to always use the next block's ParentStateRoot and just accept that eth_call will apply messages after cron. This won't be entirely accurate because the call might have produced different results if we had executed it as a part of the block but... given how users actually use eth_call, we should be fine.

Stebalien avatar Apr 18 '24 17:04 Stebalien