web3-dev-team s3 datastore

Mar 02 '21 21:03 coryschwartz

We’ve been storing blocks in S3 in a few different projects and have learned quite a lot.

The biggest one I’ll mention here just so that anyone considering implementing this is aware, we should NOT just use a string of the CID or multihash as a key, we should use whatever hash based address (CID or multihash) as a key prefix and append /data to it.

Under the hood, S3’s distribution scheme uses prefixes for data locality. If you dig into the S3 docs you’ll notice that all their read/write limits are specified as being “per prefix” which means that if you have hash based prefixes you get incredible performance. Once you’ve got a billion blocks in a bucket it performs incredibly well, we hammered S3 with thousands of concurrent Lambda functions in the dumbo drop project and we did see some perf issues as the bucket was scaling up to handle the load but once we hit a certain point there was no amount of data we couldn’t throw at it.

Mar 02 '21 22:03 mikeal

I had a short foray into a very similar design in November, and ruled out S3 ( or any other block-addressable remote storage ) entirely.

The block-size distribution of the filecoin chain is very heavily skewed towards "very small blocks". And every state ( especially recent ones ) are composed from a lot of these very small blocks, upward of 20 million. Without a mechanism to very efficiently bulk-load a large list of blocks in one go into a more appropriate local store, lotus will not perform anywhere close to an acceptable level.

For comparison my experimental Postgres blockstore uses pre-recorded lists for memory-cache warmup, and loads 35 million objects totaling about 25GiB in about 2 minutes at about 200MiB/s. From that point on lotus performs somewhat acceptably. Without this ability requesting, simply requesting "a block at a time", would keep me at validation times of ~150s/tipset.

TLDR: The design/backend need to account for a bulk-list compilation and loading mechanism, for this to be viable.

Mar 02 '21 22:03 ribasushi

@ribasushi At least for uploads, you can send a multipart message. You might be right about retrieval. In my rough try at this, I saw that it could keep up with the network, but no real performance analysis.

I suspect the performance is not as good as postgresql, and the postgresql datastore solves a whole lot of problems, and I think it makes the job of sentinel much easier as well.

To me, the key draw of S3, or a network blob store of some other kind, isn't necessarily performance (although if there are many clients reading, the performance is decent), but other benefits -- massive storage, and the ease with which machines can machines can be managed/maintenanced. My thinking when I wrote this was that S3 was probably fast enough for the daemon, and provide a lot of benefits for miners, but perhaps I was wrong about the performance assumption.

It would make sense to me to have an S3 backend to the miner (where data-centers are actually running s3-compatible ceph or something similar) and postgres for the daemons where query latency is really important. Running a configuration like this would allow daemons and miners to be restarted on other nodes quickly with little to no state local to the node.

Mar 05 '21 18:03 coryschwartz

In my rough try at this, I saw that it could keep up with the network

@coryschwartz my entire screed is predicated on the above not working. If you got it to run and keep up with mainnet: I am walking a large portion of my comment back. It is very possible that I hit some sort of throttling issue or that in-lotus caching has improved since I first tried.

Given your preliminary findings, investigating this further is definitely warranted!

Mar 05 '21 18:03 ribasushi

Moving to grants as we don't expect this to get picked up by w3dt program teams in the short term, but we do see the utility in general for this functionality.

May 26 '21 20:05 BigLep