ref-fvm icon indicating copy to clipboard operation
ref-fvm copied to clipboard

support chain lookbacks in test vectors

Open raulk opened this issue 3 years ago • 10 comments

This doesn't really belong in this repo, but capturing here to track somewhere. We can move it elsewhere, or break into various issues in various repos for implementation.

Problem

Message-class test vectors don't support ReportConsensusFault messages. Verifying consensus faults relies on the "lookback getter" mechanism in Lotus. The implementation requires an base tipset, and it walks the chain backwards til the desired epoch. It then loads the state tree for that epoch.

If we use that logic as-is at message extraction, we would end up with a very heavy test vector, because in order to get from the head to the desired tipset, we'd have to walk every intermediate tipset.

We also don't have a place in the test vector schema to specify the base tipset. That data is also out of scope for message-class test vectors.

Potential solution

At extraction time

  1. When extracting the test vector, use a special LookbackGetter that records the input and the root CID of the state tree that was returned. Do this in a way similar to how we record Randomness (with a special randomness recorder).
  2. As the state tree is accessed, the blockstore tracer already records all blocks into the local blockstore.
  3. Enhance the test vector schema to store recorded lookback responses, in a way that we store recorded randomness. Potentially in a map epoch => state tree root.
  4. When forming the CAR, make sure to include all blocks that were recorded.

At replay time

  1. When replaying the test vector, use a special LookbackGetter that returns a loaded state tree with the root CID that was found in the vector.

raulk avatar Mar 10 '22 23:03 raulk

Could we memoize the expected result from the extern? I.e.:

  1. How much gas the operation takes internally.
  2. Whether there was or was not a fault.
  3. Who was at fault.

Stebalien avatar Mar 11 '22 00:03 Stebalien

There's no gain in that:

  1. It would make us lose test coverage of VerifyConsensusFault, which is undesirable.
  2. We still have to modify the test vector schema.

raulk avatar Mar 11 '22 00:03 raulk

It would make us lose test coverage of VerifyConsensusFault, which is undesirable.

We don't currently have a verify consensus fault implementation inside the FVM.

Stebalien avatar Mar 11 '22 01:03 Stebalien

But we're changing how it's wired on the client side: https://github.com/filecoin-project/lotus/pull/8293/files#diff-9c51c6db9b9f6bea9717d162a7f80db6e0e144032849d83281750e3226d17ccaR45

So I'd rather exercise that code in the fullest form possible than stub it out.

raulk avatar Mar 11 '22 01:03 raulk

Ah, wait, you're planning on executing these test vectors from lotus? Then yes, I agree. I was focused on test vectors for the FVM itself.

Stebalien avatar Mar 11 '22 02:03 Stebalien

I think this proposal is reasonable. Really the answer is "do whatever we do for randomness". Randomness has a much further lookback than this particular chain lookback, so this is a strictly easier problem.

@raulk If I can interest you in a much hackier suggestion to immediately unblock this: Return the vector's base state. This getter we're missing is only used by the VM to determine what a faulted miner's worker key was at the lookback tipset. This lookback is also gas-less (it uses the state manager directly).

That means that unless any of the messages you're running into involve a miner that changed their worker key between the fault and the report, just returning the base state is totally fine. The likelihood of a change happening is also very low because our fault detectors work basically instantaneously (so the "confirm worker key" message would have like 1-2 epochs to land between the fault occurrence and the report).

It's a hack, but it's not worse than what we have right now.

arajasek avatar Mar 11 '22 15:03 arajasek

That's an interesting idea, might be good enough for now, but rather dangerous for a system that's intended to verify correctness unequivocally.

Re: similarities to randomness, we do stub those calls with recorded final outputs, but it's a bit different to the verify consensus fault syscall:

  • in the case of beacon randomness, it is entirely externally sourced so there's little logic to test inside the client
  • in the case of chain randomness, even though it is internally sourced (i.e. the Filecoin chain), there's also little lock to test inside the client

I am inclined to record two things inside the vector:

  1. The final output from verify consensus fault (as @Stebalien suggests) -- for testers that do not have a full client at their disposal, e.g. the conformance tests in the FVM
  2. The result from the lookback (the root CID of the state tree) -- for testers that do have a full client at their disposal, e.g. Lotus.

raulk avatar Mar 11 '22 15:03 raulk

Re: similarities to randomness, we do stub those calls with recorded final outputs, but it's a bit different to the verify consensus fault syscall:

I'm not sure I understand / agree. They're all fundamentally testing the same aspect of a chain -- "did you walk up the chain correctly and stop at the right point?" This is the case for the lookback tipset, chain randomness, and beacon randomness (which, within a Filecoin client, is sourced from the chain too). From my perspective, it's:

  • Beacon randomness: Walk up the chain, stop at the right point, load the block headers, and draw the right beacon out of them
  • Chain randomness: Walk up the chain, stop at the right point, load the block headers, and draw the right ticket out of them
  • Lookback state: Walk up the chain, stop at the right point, load the block headers, and return the right Parent State Root

And all of this is gas-less, so they seem completely analogous to me.

arajasek avatar Mar 11 '22 16:03 arajasek

Re: similarities to randomness, we do stub those calls with recorded final outputs, but it's a bit different to the verify consensus fault syscall:

I was comparing to stubbing the verify consensus fault syscall (what Steb was suggesting). By recording and playing that call, we lose coverage of the logic of that call if the tester supports it.

raulk avatar Mar 11 '22 16:03 raulk

@arajasek The workaround worked. Let's rely on that for now.

raulk avatar Mar 11 '22 16:03 raulk