nimbus-eth1 icon indicating copy to clipboard operation
nimbus-eth1 copied to clipboard

Split up content databases (kvstores) per network

Open kdeme opened this issue 2 years ago • 6 comments

Currently the quickest, simplest approach is taken and all is stored in one table / kvstore. However, this will not be scalable once we are dealing with lots of data. This issue is about how to split this storage, basically this comment: https://github.com/status-im/nimbus-eth1/blob/master/fluffy/content_db.nim#L24

I think approach 1. mentioned there is probably the most straightforward path to take, but some investigation to better understand the implications of (the) other approaches is allowed ;-).

kdeme avatar May 12 '22 15:05 kdeme

I thought about it a bit and I like option 3 the most tbh. My reasons for it:

  • having db per network make it possible to delegate handling of radius of each network to its database code which makes code a bit nicer.
  • having db per network discourages sharing data thorough database, and encourages sharing data between networks through well defined interfaces.
  • having one database make pruning code a bit more tricky, as then we need to how much data each table is taking which probably force us to keep more metadata in db.

Disadvantages are:

  • we have more files on disk (for 5 networks it will be 5 files, so imo it is not a big deal)
  • to known how much data we store we need to call db.size few times and sum the values. Although it is rather not a big problem, as db limits should be defined per network basis.

KonradStaniec avatar May 13 '22 12:05 KonradStaniec

I thought about it a bit and I like option 3 the most tbh. My reasons for it:

I don't think any of those reasons apply only to having a separate database, do they? At first sight, it looks that could also be abstracted away with a separate object for each network (call it ContentStore, or so).

Good point on how this in general does get more complicated for pruning, over different networks, on an add of a specific network. Not sure if I'd opt for different Radius handling per network (At least not if it makes things much more complex), as I'm not sure there is a need for this?

Anyway, these are exactly the things that need to be sketched out better I think before we make too many changes.

kdeme avatar May 16 '22 15:05 kdeme

FYI, similar (albeit not the same) technical question: https://github.com/status-im/nimbus-eth2/blob/039bece9175104b5c87a8c2ff6b1eafae731b05e/beacon_chain/validators/slashing_protection_v2.nim#L119

kdeme avatar May 23 '22 12:05 kdeme

Not sure if I'd opt for different Radius handling per network (At least not if it makes things much more complex), as I'm not sure there is a need for this?

It maybe more complex to have one more global radius, as different networks have different sizes so to adjust global radius we would need to take it into account somehow not to monopolize node storage by one type of data. Where in radius per network we keep the same size proportional logic everywhere.

Anyway, these are exactly the things that need to be sketched out better I think before we make too many changes.

Having db per network incurs probably least changes for now (as we have one working network) it is just the question of initiating db in history network constructor instead of fluffy main. Where with multiple kvstores we would also need to update queries and calculating sizes of db. Configs would need to be updated in both approaches as in both of them user should configure different sizes for different networks. (at least I think so)

I wonder, maybe we shoould delay making the decision until having another network, and some endpoint which get data from both of them, then implement some proof of concept for both approaches and then see which one we like more ?

KonradStaniec avatar Jun 03 '22 07:06 KonradStaniec

It maybe more complex to have one more global radius, as different networks have different sizes so to adjust global radius we would need to take it into account somehow not to monopolize node storage by one type of data.

Different networks will have indeed different sizes, but I think that is fine. A node's storage ratios pet network would ideally represent the network total storage ratios. This shouldn't be an issue for the global radius as long as content on each network is evenly distributed over the id space (which it should). A use case to set different network sizes would perhaps be if a user want to have all / a lot of data for one specific network stored because the user needs that specific data continuously for some reason (and with low latency).

Where with multiple kvstores we would also need to update queries and calculating sizes of db.

Sure, but with different databases you will also add some complexity, unless the idea was to just split the total storage per amount of networks evenly. Which would probably be not correct, see comment above.

I wonder, maybe we shoould delay making the decision until having another network,

Sure, we can wait with this. I actually want to add a small second database for the accumulator data, as we cant access this data over the network yet, and I don't want it to be pruned among the other data. (This will be under an optional flag when run)

kdeme avatar Jun 03 '22 08:06 kdeme

Related discussion in Portal discord raised some interesting points:

  • Usage of the same kvstore / table for networks with different distance function will give issues in storage & pruning
  • Shared radius over subprotocols might give issues on lower storage / radii settings: dominating content for subprocotols with lots of data?

kdeme avatar Jun 08 '22 18:06 kdeme