Resource hardening

Open lemmih opened this issue 11 months ago • 1 comments

Summary

Executing Filecoin transactions are computationally expensive, caches are vital for performance but eat up memory, we rely on the p2p swarm but any individual peer may be unreliable. As of writing, Forest does not have any clear bounds on computation, memory, or network usage.

Motivation

Forest operators would benefit from clear requirements. Additionally, having bounded requirements is essential for security and resisting attacks.

Tasks

[x] Enumerate all caches, justify their size, and minimize if possible.
[x] Decouple the swarm from the execution layer.
[ ] Make sure only validated data is committed to the database.
[ ] List all places where data is written to our database.

Create adversarial p2p nodes:

[ ] Spam a specific peer with gossip block CIDs. Right now, we synchronously download blocks when a p2p peer gives us its CID. This can cause Forest to fall out of sync (and have in the past) because we were waiting on block downloads.
[ ] Spam fork blocks. Blocks from forks have to be kept around for 900 epochs. A spammer might fill our database with junk by sending seemingly valid blocks from forks.

Milestones

[ ] Quantify the fixed memory usage of Forest in stateless mode. Forest uses a fixed amount of memory when running in stateless mode. We keep track of a fixed number of peers, cache a fixed number of tipsets, and track a fixed number of forks. From this, we can give tight bounds on the memory Forest is expected to use in a real-world deployment.
[ ] Quantify the memory bounds of Forest in validator mode.
[ ] List all data that Forest retains (blocks, tipsets, messages, etc.), and show that it cannot be spammed cheaply by an attacker (by, say, showing that proofs are validated before blocks are stored in the database). We currently allow database access at nearly every point in the Forest code base. This leaves a large attack surface for adversarial nodes to populate our database with junk. Bloating our database is a form of a denial-of-service attack, and we should consider database writes to be security critical.
[ ] Remove all IO from the main event loop and show that the event handler will never block indefinitely. On December 4th, the Infra team reported that Forest would slowly fall out of sync with the network. Investigation showed that a network call in the main event loop caused this. Network latency would prevent Forest from processing new events and validating tipsets. We worked around the issue using a different network call with a lower average latency. However, Forest may still fall behind if the network conditions degrade. A proper fix would be to remove all network requests from the main event loop.
[ ] Create an adversarial node to spam Forest with irrelevant data.
[ ] Create an adversarial node for denial-of-service attacks.

Risks & Dependencies

Additional Links & Resources

Related: #2469

Jan 20 '25 11:01 lemmih

This might also be somewhat related https://github.com/ChainSafe/forest/issues/2469.

Jan 20 '25 14:01 elmattic