apps Application: Data Processing Workflows on IPFS

Work in progress - please contribute. See ipfs/apps#40.

Essential Use Case: when running data processing/analysis workflows, use IPFS as the storage layer. This allows your workflows to be agnostic about where the data are stored -- pulling all the source data onto a local node before running a workflow is an optimisation choice that can be done on the fly with zero impact on the code. Likewise, the results of the workflows can be written to IPFS and moved around as needed without impacting the referential integrity of your data.

Mar 17 '17 21:03 flyingzumwalt

Data Processing Workflows on IPFS

Essential Use Case

use ipfs as a source or sink in data processing pipelines

Use Cases

Concrete

Alice wants to render all tiles in a world map, using hundreds of worker computers.
Bob wants to run calculations over a stream of geo-tagged events, generated by millions of users.
Charlie wants to make available a down-sampled, filtered, and versioned archive of astrophysics photography.
Dana wants to train a deep-learning object classifier for use in self-driving cars.
Eve wants to run speech-to-text transcription and NLP tagging over large volumes of audio data.
Faye wants to run tests (and collect results) for her distributed file system on thousands of machines around the world.

Groupings:

MapReduce / Hadoop use cases (IPFS takes the role of GFS or HDFS)
Spark use cases (IPFS takes the role of data source & sink)
Real-time data stream processing (ingest data into IPFS, process it as it appears)
Versioning of datasets processed
Storing and distributing datasets processed
P2P backbone for a distributed computing cluster
Scientists and Data Scientists running data experiments
- in part. Machine Learning, Bioinformatics, Astronomy, etc.

Other:

IPFS can be source and sink for data
IPFS can store intermediate results
IPFS can deduplicate data and maybe work
IPFS can version all results
libp2p pubsub can be used to announce events

Foundational Features + Functionality

ipfs basics (add, cat)
ipfs data importers (for better perf and dedup)
ipfs versioning
very high throughput & perf
libp2p pubsub (to announce events)
libp2p pluggable routing to have fast content-routing

Existing Projects + Organizations Working in this Area

There have been a number of people who have expressed desire to:

do dataset processing on datasets hosted on IPFS (eg @jonnycrunch)
build a distributed computing system on top of IPFS. (eg @diasdavid, @jbenet, @mitar, Golem Project)

Our own test lab: https://github.com/ipfs/test-lab
things like BrowserCloud, Pando, etc.
things like Golem Project

Apr 05 '17 15:04 jbenet

Read this very interesting architecture: https://www.cse.unsw.edu.au/~hpaik/thesis/showcases/16s2/scott_brisbane.pdf

Planning on building the full scale services..

May 26 '18 19:05 0zAND1z

@kggp1995 https://s3-ap-southeast-2.amazonaws.com/scott-brisbane-thesis/decentralising-big-data-processing.pdf

May 27 '18 22:05 saurabhdhupar

Thanks for adding the link to the full report

May 28 '18 04:05 0zAND1z

Great, we're trying

Jun 04 '18 03:06 akevy

@scottybrisbane do you have a public repo with your ipfs/hdfs integration?

Jun 17 '18 19:06 echarles

@echarles not just yet, although I am planning to post my work. It's very much a POC, but could be a good starting point for anyone wanting to get something going.

I'll update this thread when I post the code.

Jun 18 '18 10:06 scottybrisbane

Thx @scottybrisbane - POC is very fine. Once published, I expect contributors (like me) to try and let evolve the code. Without pushing you, any ideas on the timeline? (do we speak about days, weeks, months... before having something public?). Btw If you fear uncompleted feature, not-perfect code, no documentation... just push what you have and other will help, that's how opensource works.

Jun 18 '18 10:06 echarles

@echarles I'm hoping to have it up within a few weeks.

Jun 18 '18 20:06 scottybrisbane

Hey Folks, Any update on this topic? @scottybrisbane @echarles

Sep 30 '18 12:09 bo-liu

Hi, can I ask what sort of data processing people want to use IPFS for?

On Sun, Sep 30, 2018, 05:02 bo-liu [email protected] wrote:

Hey Folks, Any update on this topic? @scottybrisbane https://github.com/scottybrisbane @echarles https://github.com/echarles

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ipfs/ipfs/issues/248#issuecomment-425715959, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAcndBIs-9jQbaw9-nNJW4aHiH7XTLGks5ugLLTgaJpZM4MhHjR .

Sep 30 '18 15:09 ajbouh

Hi, can I ask what sort of data processing people want to use IPFS for?

Really depends on what kind of data(or which kind of data you are interested in) stored inside IPFS.

Oct 09 '18 07:10 bo-liu

Right, I'm looking to help make sure IPFS is a good fit for the kind of data processing people want to do.

Getting specific examples helps me ensure we're putting effort in the right places.

Oct 09 '18 07:10 ajbouh

@scottybrisbane Your work is really interesting! The use case is fascinating. Do you know when you will be able to post your work?

Nov 16 '18 15:11 bertrandfalguiere

KIP team(@kipfoundation) is working on one of the reference implementation that may align with Scotty's work. More about our implementation of big data persistence will be updated under section 7 here: https://kipfoundation.github.io/techprimer/7-Realm-Storage.html

Look forward to adding HDFS support together!

Nov 17 '18 12:11 0zAND1z

@scottybrisbane Great work! Are you going to publish your code? I would really like to make use of it in my own thesis, so if you need some help, just hit me up :)