apps icon indicating copy to clipboard operation
apps copied to clipboard

Application: Data Processing Workflows on IPFS

Open flyingzumwalt opened this issue 7 years ago • 18 comments

Work in progress - please contribute. See ipfs/apps#40.

Essential Use Case: when running data processing/analysis workflows, use IPFS as the storage layer. This allows your workflows to be agnostic about where the data are stored -- pulling all the source data onto a local node before running a workflow is an optimisation choice that can be done on the fly with zero impact on the code. Likewise, the results of the workflows can be written to IPFS and moved around as needed without impacting the referential integrity of your data.

flyingzumwalt avatar Mar 17 '17 21:03 flyingzumwalt

Data Processing Workflows on IPFS

Essential Use Case

  • use ipfs as a source or sink in data processing pipelines

Use Cases

Concrete

  • Alice wants to render all tiles in a world map, using hundreds of worker computers.
  • Bob wants to run calculations over a stream of geo-tagged events, generated by millions of users.
  • Charlie wants to make available a down-sampled, filtered, and versioned archive of astrophysics photography.
  • Dana wants to train a deep-learning object classifier for use in self-driving cars.
  • Eve wants to run speech-to-text transcription and NLP tagging over large volumes of audio data.
  • Faye wants to run tests (and collect results) for her distributed file system on thousands of machines around the world.

Groupings:

  • MapReduce / Hadoop use cases (IPFS takes the role of GFS or HDFS)
  • Spark use cases (IPFS takes the role of data source & sink)
  • Real-time data stream processing (ingest data into IPFS, process it as it appears)
  • Versioning of datasets processed
  • Storing and distributing datasets processed
  • P2P backbone for a distributed computing cluster
  • Scientists and Data Scientists running data experiments
    • in part. Machine Learning, Bioinformatics, Astronomy, etc.

Other:

  • IPFS can be source and sink for data
  • IPFS can store intermediate results
  • IPFS can deduplicate data and maybe work
  • IPFS can version all results
  • libp2p pubsub can be used to announce events

Foundational Features + Functionality

  • ipfs basics (add, cat)
  • ipfs data importers (for better perf and dedup)
  • ipfs versioning
  • very high throughput & perf
  • libp2p pubsub (to announce events)
  • libp2p pluggable routing to have fast content-routing

Existing Projects + Organizations Working in this Area

There have been a number of people who have expressed desire to:

  • do dataset processing on datasets hosted on IPFS (eg @jonnycrunch)
  • build a distributed computing system on top of IPFS. (eg @diasdavid, @jbenet, @mitar, Golem Project)
  1. Our own test lab: https://github.com/ipfs/test-lab
  2. things like BrowserCloud, Pando, etc.
  3. things like Golem Project

jbenet avatar Apr 05 '17 15:04 jbenet

Read this very interesting architecture: https://www.cse.unsw.edu.au/~hpaik/thesis/showcases/16s2/scott_brisbane.pdf

Planning on building the full scale services..

0zAND1z avatar May 26 '18 19:05 0zAND1z

@kggp1995 https://s3-ap-southeast-2.amazonaws.com/scott-brisbane-thesis/decentralising-big-data-processing.pdf

saurabhdhupar avatar May 27 '18 22:05 saurabhdhupar

Thanks for adding the link to the full report

0zAND1z avatar May 28 '18 04:05 0zAND1z

Great, we're trying

akevy avatar Jun 04 '18 03:06 akevy

@scottybrisbane do you have a public repo with your ipfs/hdfs integration?

echarles avatar Jun 17 '18 19:06 echarles

@echarles not just yet, although I am planning to post my work. It's very much a POC, but could be a good starting point for anyone wanting to get something going.

I'll update this thread when I post the code.

scottybrisbane avatar Jun 18 '18 10:06 scottybrisbane

Thx @scottybrisbane - POC is very fine. Once published, I expect contributors (like me) to try and let evolve the code. Without pushing you, any ideas on the timeline? (do we speak about days, weeks, months... before having something public?). Btw If you fear uncompleted feature, not-perfect code, no documentation... just push what you have and other will help, that's how opensource works.

echarles avatar Jun 18 '18 10:06 echarles

@echarles I'm hoping to have it up within a few weeks.

scottybrisbane avatar Jun 18 '18 20:06 scottybrisbane

Hey Folks, Any update on this topic? @scottybrisbane @echarles

bo-liu avatar Sep 30 '18 12:09 bo-liu

Hi, can I ask what sort of data processing people want to use IPFS for?

On Sun, Sep 30, 2018, 05:02 bo-liu [email protected] wrote:

Hey Folks, Any update on this topic? @scottybrisbane https://github.com/scottybrisbane @echarles https://github.com/echarles

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ipfs/ipfs/issues/248#issuecomment-425715959, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAcndBIs-9jQbaw9-nNJW4aHiH7XTLGks5ugLLTgaJpZM4MhHjR .

ajbouh avatar Sep 30 '18 15:09 ajbouh

Hi, can I ask what sort of data processing people want to use IPFS for?

Really depends on what kind of data(or which kind of data you are interested in) stored inside IPFS.

bo-liu avatar Oct 09 '18 07:10 bo-liu

Right, I'm looking to help make sure IPFS is a good fit for the kind of data processing people want to do.

Getting specific examples helps me ensure we're putting effort in the right places.

ajbouh avatar Oct 09 '18 07:10 ajbouh

@scottybrisbane Your work is really interesting! The use case is fascinating. Do you know when you will be able to post your work?

bertrandfalguiere avatar Nov 16 '18 15:11 bertrandfalguiere

KIP team(@kipfoundation) is working on one of the reference implementation that may align with Scotty's work. More about our implementation of big data persistence will be updated under section 7 here: https://kipfoundation.github.io/techprimer/7-Realm-Storage.html

Look forward to adding HDFS support together!

0zAND1z avatar Nov 17 '18 12:11 0zAND1z

@scottybrisbane Great work! Are you going to publish your code? I would really like to make use of it in my own thesis, so if you need some help, just hit me up :)

Duske avatar Apr 05 '19 15:04 Duske

Note: Discussion on applications of IPFS are happening over in the IPFS Forums now ... please continue the discussion there!

This issue is being moved over to the archived repo https://github.com/ipfs/apps/ for reference.

jessicaschilling avatar Mar 26 '20 18:03 jessicaschilling

https://github.com/filecoin-project/bacalhau/wiki

lookfirst avatar Jan 14 '23 23:01 lookfirst