snarkOS icon indicating copy to clipboard operation
snarkOS copied to clipboard

fix: limit outstanding certificate requests

Open joske opened this issue 1 year ago • 5 comments

This attempts to fix https://github.com/AleoHQ/snarkOS/issues/2935.

I was able to reproduce using the following malicious code (credit @niklaslong): https://github.com/niklaslong/snarkOS/tree/tmp/2935-replication

Run one node with this code, and the rest with normal testnet3. After 20 rounds, the malicious node will send fake certificate requests.

With this fix, the nodes reject the bad certs after the max outstanding requests of 20 is reached, and then carry on, before all nodes would crash with stack overflow.

The MAX_OUTSTANDING_CERTIFICATE_REQUESTS is chosen arbitrarily at 20, please change as you see fit.

joske avatar Dec 27 '23 12:12 joske

I am the reporter of #2935. The root cause of the issue is that the validator didn't verify the certificate before entering recursion. I think we can resolve the issue more directly:

  1. Upon receiving a BatchPropose, verify the direct parent certificates are all valid.
  2. Upon receiving a Certificate through broadcast, immediately verify if it is valid.

Also, we may support Batch Certificate Request and avoid using recursion.

randomsleep avatar Dec 28 '23 03:12 randomsleep

I am the reporter of #2935. The root cause of the issue is that the validator didn't verify the certificate before entering recursion. I think we can resolve the issue more directly:

  1. Upon receiving a BatchPropose, verify the direct parent certificates are all valid.
  2. Upon receiving a Certificate through broadcast, immediately verify if it is valid.

Also, we may support Batch Certificate Request and avoid using recursion.

Verify the BatchPropose or BatchCertificate rely on the previous certificates. And we don't know the committee has changed or not when receive a BatchPropose or BatchCertificate.

ghostant-1017 avatar Dec 28 '23 07:12 ghostant-1017

I am the reporter of #2935. The root cause of the issue is that the validator didn't verify the certificate before entering recursion. I think we can resolve the issue more directly:

  1. Upon receiving a BatchPropose, verify the direct parent certificates are all valid.
  2. Upon receiving a Certificate through broadcast, immediately verify if it is valid.

Also, we may support Batch Certificate Request and avoid using recursion.

Verify the BatchPropose or BatchCertificate rely on the previous certificates. And we don't know the committee has changed or not when receive a BatchPropose or BatchCertificate.

Yes, this is a problem for the current implementation.

I submitted another bug report before, which is about the committee update. The report shows that updating the committee at every block will cause some consensus issues. Maybe we can fix that issue first to make it easier to validate the certificate.

randomsleep avatar Dec 28 '23 09:12 randomsleep

@randomsleep @ghostant-1017 Is the issue still present in latest commit of mainnet-staging? If there is a better design, we can close this PR and work on a new issue.

howardwu avatar Mar 08 '24 23:03 howardwu

@howardwu I believe this issue still exists in mainnet-staging. Now the committee is fixed for every round. We can address this issue by:

  1. Upon receiving a Certificate, immediately check if it has reached quorum before recursion. If the Certificate is the parent of an already checked Certificate, we can skip the check.
  2. Upon receiving a BatchPropose, immediately perform some checks before recursion. Make sure the direct parent Certificate has reached quorum and parent_round == batch_propose_round + 1

randomsleep avatar Mar 09 '24 09:03 randomsleep