swarm
swarm copied to clipboard
forky: Fixed Chunk Data Size Store
This PR changes the chunk data disk persistence structure. It is based on previous experiment https://github.com/janos/forky where multiple approaches were tried out. The most performant was chosen and added to swarm as storage/fcds
package.
This PR includes additional changes required for new storage to be used and optionally manually migrated.
FCDS is integrated into storage/localstore
without breaking changes to the package API. That involves:
- removing usage of retrievalDataIndex, except to provide export functionality for older db schema
- replacing retrievalAccessIndex with metaIndex that contains storage timestamp, access timestamp and bin id
At the roundtable discussion, it was decided to have an optional manual data migration. To achieve this,
instructions with steps are printed when the new swarm version is started with older data format. LocalStore
migrations are adjusted to make this functionality. Some issues to getMigrations
function are discovered
that are fixed and tested in this pr through additional tests. Schema name variables are now unexported
and names from the legacy ldbstore removed from migrations as they are not needed to be there.
In order for migration to be complete, import and export localstore functions needed to handle pinning information. This functionality is added.
Measurements
Measurements are performed locally on 4 core MacBook Pro mid 2014. Every run on a clean swarm data directory by uploading files with random data.
Local test1 - 1GB size - 5x speedup
time ./swarm.master up test1
c4e105675ab10acc907ac4e966aa0359eb0accc5fe5bd03dc15d8161e5fa1dda
./swarm.master up test1 0.98s user 1.85s system 1% cpu 3:52.39 total
time ./swarm.fcds up test1
c4e105675ab10acc907ac4e966aa0359eb0accc5fe5bd03dc15d8161e5fa1dda
./swarm.forky up test1 1.01s user 1.94s system 6% cpu 46.629 total
Local test4 - 4GB size - 6.5x speedup
time ./swarm.master up test4
b2c1bae070933e5c46d1f839340e2ea33c77469e1c8210691cbe0ed79b211506
./swarm.master up test4 3.91s user 7.37s system 0% cpu 26:27.17 total
time ./swarm.fcds up test4
b2c1bae070933e5c46d1f839340e2ea33c77469e1c8210691cbe0ed79b211506
./swarm.forky up test4 4.30s user 8.50s system 5% cpu 4:06.79 total
Smoke tests on cluster
Smoke tests were run multiple times for validation and to measure performance. However, the performance gain is related to the number of cpus that ec2 node has and how many swarm nodes are running on the same ec2 node.
These are results from running one swarm node on 2 core c5.large https://snapshot.raintank.io/dashboard/snapshot/dD0JQruCpHpOjvKqwR6YN3ay3GmONLyr. If there are two swarm processes on the same node, upload speed is about the half. It is noticeable that garbage collection influences performance and this is the area where further adjustments can be made.
I am confirming that smoke tests pass with the current state of this PR https://snapshot.raintank.io/dashboard/snapshot/BDyvSKICIk5a0AbAjXxBeJfgvEvvpiqP. These results are with deployment of 3 swarm pods per one c5.large ec2 node. The first two failures are because smoke test job started before all swarm nodes in the cluster.
Thanks, @santicomp2014. I have updated this branch with the current master and pushed docker image to janos/swarm:fcds
, for testing.
also please do not merge until the next release is out
@acud could you reject this PR until the next release is released, to block merging that way? :)
Based on recent discussions and decision to go for a more reliable and in general better approach with using Badger for chunk data storage, I am closing this PR. Thank you all for investing time to improve, test and review this PR, and I am sorry for a long attempt that was so late identified as not the best approach to improve storage performance. At least the high level design, localstore integration and migration part could be reused.
Please don't delete the branch. This should be used to test performance with the badger.
@jmozah Of course. I deliberately did not delete the branch.
Based on recent discussions and decision to go for a more reliable and in general better approach with using Badger for chunk data storage, I am closing this PR. Thank you all for investing time to improve, test and review this PR, and I am sorry for a long attempt that was so late identified as not the best approach to improve storage performance. At least the high level design, localstore integration and migration part could be reused.
Hi , could you please just summarize why the approach was abandoned? Not everyone is able to join in these discussions, still it would be nice to have it documented here.