specs icon indicating copy to clipboard operation
specs copied to clipboard

Research/quantify performance envelopes of multiple CDC algorighms

Open ribasushi opened this issue 4 years ago • 12 comments

  • [ ] oI 95% Assemble corpuses of data from various prior performance research initiatives ( both within and outside of PL )
    • [x] 💯 Enumerate/obtain test datasets
    • [ ] 90% Document rationales for the test datasets
    • [ ] 95% Publish all of the above as plain HTTP + IPFS pinned download
  • [ ] oI 85% Document prior art, motivation and precise scope and types of sought metrics
    • [x] 💯 Solicit/assemble feedback from various stakeholders
    • [x] 💯 Collect/determine relevance of existing academic research into chunking ( 14 distinct papers selected for evaluation )
    • [x] 💯 Convert the pre-PL chunk-tester to proper multi-streaming, to dramatically lower the cost of experiments ~( aiming at about 500 megabyte/s stream processing )~ with the correct implementation and hardware about 3.5GiB/s standard ingestion 🎉
    • [ ] 80% Generate few preliminary datapoints to aid understanding the goal/scope
    • [ ] 90% In depth study/evaluation/application of findings from above works
    • [x] 💯 Understand and reuse existing go-ipfs implementations of CDCs ( Rabin + Buzzhash ) in a simpler go-ipfs independent utility, allowing rapid retries of different parameters
    • [x] 💯 Same as above but pertaining to linking strategies ( trickle-dag etc ), as ignoring the link-layer of streams skews the results disproportionately
    • [ ] 98% ( subsumes a large portion of points below v0.1 ETA: DEMO AT TEAM-WEEK ) Fully implement a standalone CLI utility re-implementing/converging with go-ipfs on all above algorithms. The distinguishing feature of said tool is the exposure of each chunker/linker as an atomic, composable primitive. The UX is similar to that of ffmpeg whereby an input stream is processed via multiple "filters", with the result being a stream of blocks with a statistic on their counts/sizes plus a valid IPFS CID. Current remaining tasks:
      • [x] 💯 Profile/optimize baseline stream ingestion, ensure there is no penalty from applying a "null-filter", which allows one to benchmark a particular hardware setup's theoritcal maximum throughput
      • [x] 💯 Finalize the "stackable chunkers" UI/UX, allowing effortless demonstration of impact of such chunker chains on the
      • [x] 💯 Adjust statistics compilation/output for the above ( it currently looks like this, ignoring various "filter-levels" )
      • [x] 💯 Make final pass on memory allocation profile and fixup obvious low hanging fruit before v0.1
      • [ ] 80% README / godoc / stuffz
    • [ ] 80% Rewrite previously utilized plotly.js-based visualiser to aid with the above point
  • [ ] oI Open document to a short discussion soliciting feedback from workgroups
  • [ ] oII Perform a number of "brute force" tests aiming at reproducible results ( utilizing https://github.com/ipfs/testground ) ~for the purposes of what we are trying to quantify iptb will be sufficient~
  • [ ] oII ( half-covered by initial writeup ) Convert raw results into multi-dimensional scatter-plot visualizations ( plotly.js )
  • [ ] oIII Combine all available results into a "compromise chunking settings" RFC document
  • [ ] oIV Publish the results for discussion and decision of the level of incorporation into IPFS implementations ( default parameters, use of selected algorithm by default, etc )

ribasushi avatar Dec 04 '19 19:12 ribasushi