forest icon indicating copy to clipboard operation
forest copied to clipboard

Fix `Filecoin.StateReplay` test (or implementation)

Open LesnyRumcajs opened this issue 1 year ago • 9 comments

Issue summary

The method was disabled in https://github.com/ChainSafe/forest/pull/4429. Once fixed, it should be removed from the filter lists.

Other information and links

LesnyRumcajs avatar Jun 20 '24 12:06 LesnyRumcajs

Payload to reproduce the failure:

{"jsonrpc":"2.0","id":0,"method":"Filecoin.StateReplay","params":
[[{"/":"bafy2bzacebjderqonlhtcp42rxldzndqm7tkfhuq62qauqseig2kur4tq6qz4"},{"/":"bafy2bzaceaaro3br2ytfoxhjp5fh26b6u6w4uzy6hvx5g3vzkfb6v3twjjtmk"},{"/":"bafy2bzacebqp4axkcpphgt36hhufcjqni6q62ggjfrn2iafjzsjdjdbwyzdby"}],{"/":"bafy2bzaceaekui2v37hfn54wgijzquegmwfwthug4cp3fvvgidpzbdvepthje"}]
}

the only sensible diff:

2024-06-20T07:33:51.7767290Z                  "MsgRct": {
2024-06-20T07:33:51.7767298Z 
2024-06-20T07:33:51.7767551Z -                  "ExitCode": 0, \\ forest
2024-06-20T07:33:51.7767559Z 
2024-06-20T07:33:51.7767717Z +                  "ExitCode": 10, \\ lotus
2024-06-20T07:33:51.7767725Z 
2024-06-20T07:33:51.7767899Z                    "Return": null,
2024-06-20T07:33:51.7767907Z 
2024-06-20T07:33:51.7768067Z                    "ReturnCodec": 0
2024-06-20T07:33:51.7768075Z 
2024-06-20T07:33:51.7768208Z                  },

hanabi1224 avatar Jun 20 '24 12:06 hanabi1224

The only sensible diff(part of FVM execution trace) comes from FVM. The root cause might be that forest is using [email protected] while lotus-v1.17 is using [email protected] via filecoin-ffi The issue might be automatically resolved when both forest and lotus switch to [email protected]

@LesnyRumcajs, do you think it would be a proper mitigation if we just ignore the ExitCode field for now? Since there's very little we can do about the only diff in responses in this failure.

Note on exit code 10:

/// An internal VM assertion failed.
pub const SYS_ASSERTION_FAILED: ExitCode = ExitCode::new(10);

Note2: the exit code is 0 (success) for forest on both [email protected] and [email protected]. It might be a regression in [email protected] but it takes non-trival effort to verify this in forest with [email protected]

hanabi1224 avatar Jun 20 '24 12:06 hanabi1224

@hanabi1224 I am not convinced this is a proper fix. The goal of those tests is to ensure response parity between Forest and Lotus at specific versions, and not partial parity. I'd rather the test either fails or not, be it from a single byte diff or the entire payload. If there is no parity between Forest and Lotus at one point (like now), we must know it.

LesnyRumcajs avatar Jun 20 '24 13:06 LesnyRumcajs

@LesnyRumcajs For this particular RPC method, the response is so large(168kb) and complicated that I have to attach the files, everything matches except this exit_code field. We could also wait for a lotus release with [email protected] and check again

forest.json lotus.json

hanabi1224 avatar Jun 20 '24 14:06 hanabi1224

I'd rather we wait for a Lotus FVM bump and if it still doesn't match, we investigate it further.

LesnyRumcajs avatar Jun 20 '24 14:06 LesnyRumcajs

On a side note, did you try it with a Lotus from ~1 month ago, before they bumped the FVM?

LesnyRumcajs avatar Jun 20 '24 14:06 LesnyRumcajs

On a side note, did you try it with a Lotus from ~1 month ago, before they bumped the FVM?

@LesnyRumcajs good point, I will do that tomorrow

hanabi1224 avatar Jun 20 '24 14:06 hanabi1224

Interestingly, I tried with a local build of Lotus 1.27.0([email protected]), and the exit code is also 10, which differs from the forest.

hanabi1224 avatar Jun 20 '24 15:06 hanabi1224

Some feature flags, FVM options?

LesnyRumcajs avatar Jun 20 '24 15:06 LesnyRumcajs

Can't reproduce it on a recent snapshot. Do we have one available for testing?

forest-tool api compare --lotus /ip4/127.0.0.1/tcp/1234/http --forest /ip4/127.0.0.1/tcp/2345/http forest_snapshot_calibnet_2025-01-29_height_2361215.forest.car.zst  --filter StateReplay -n 1800
| RPC Method                 | Forest | Lotus |
|----------------------------|--------|-------|
| Filecoin.StateReplay (316) | Valid  | Valid |

elmattic avatar Jan 30 '25 15:01 elmattic

@elmattic, let's mark it as Filecoin.StateReploy as fixed and re-open it with an offending snapshot uploaded if it happens again.

LesnyRumcajs avatar Jan 30 '25 15:01 LesnyRumcajs

Note that we've updated our shim here to use V4 for exit codes, so that might have fixed something.

https://github.com/ChainSafe/forest/pull/4991/

elmattic avatar Jan 30 '25 15:01 elmattic