apps
apps copied to clipboard
Application: Data Processing Workflows on IPFS
Work in progress - please contribute. See ipfs/apps#40.
Essential Use Case: when running data processing/analysis workflows, use IPFS as the storage layer. This allows your workflows to be agnostic about where the data are stored -- pulling all the source data onto a local node before running a workflow is an optimisation choice that can be done on the fly with zero impact on the code. Likewise, the results of the workflows can be written to IPFS and moved around as needed without impacting the referential integrity of your data.
Data Processing Workflows on IPFS
Essential Use Case
- use ipfs as a source or sink in data processing pipelines
Use Cases
Concrete
- Alice wants to render all tiles in a world map, using hundreds of worker computers.
- Bob wants to run calculations over a stream of geo-tagged events, generated by millions of users.
- Charlie wants to make available a down-sampled, filtered, and versioned archive of astrophysics photography.
- Dana wants to train a deep-learning object classifier for use in self-driving cars.
- Eve wants to run speech-to-text transcription and NLP tagging over large volumes of audio data.
- Faye wants to run tests (and collect results) for her distributed file system on thousands of machines around the world.
Groupings:
- MapReduce / Hadoop use cases (IPFS takes the role of GFS or HDFS)
- Spark use cases (IPFS takes the role of data source & sink)
- Real-time data stream processing (ingest data into IPFS, process it as it appears)
- Versioning of datasets processed
- Storing and distributing datasets processed
- P2P backbone for a distributed computing cluster
- Scientists and Data Scientists running data experiments
- in part. Machine Learning, Bioinformatics, Astronomy, etc.
Other:
- IPFS can be source and sink for data
- IPFS can store intermediate results
- IPFS can deduplicate data and maybe work
- IPFS can version all results
- libp2p pubsub can be used to announce events
Foundational Features + Functionality
- ipfs basics (add, cat)
- ipfs data importers (for better perf and dedup)
- ipfs versioning
- very high throughput & perf
- libp2p pubsub (to announce events)
- libp2p pluggable routing to have fast content-routing
Existing Projects + Organizations Working in this Area
There have been a number of people who have expressed desire to:
- do dataset processing on datasets hosted on IPFS (eg @jonnycrunch)
- build a distributed computing system on top of IPFS. (eg @diasdavid, @jbenet, @mitar, Golem Project)
- Our own test lab: https://github.com/ipfs/test-lab
- things like BrowserCloud, Pando, etc.
- things like Golem Project
Read this very interesting architecture: https://www.cse.unsw.edu.au/~hpaik/thesis/showcases/16s2/scott_brisbane.pdf
Planning on building the full scale services..
@kggp1995 https://s3-ap-southeast-2.amazonaws.com/scott-brisbane-thesis/decentralising-big-data-processing.pdf
Thanks for adding the link to the full report
Great, we're trying
@scottybrisbane do you have a public repo with your ipfs/hdfs integration?
@echarles not just yet, although I am planning to post my work. It's very much a POC, but could be a good starting point for anyone wanting to get something going.
I'll update this thread when I post the code.
Thx @scottybrisbane - POC is very fine. Once published, I expect contributors (like me) to try and let evolve the code. Without pushing you, any ideas on the timeline? (do we speak about days, weeks, months... before having something public?). Btw If you fear uncompleted feature, not-perfect code, no documentation... just push what you have and other will help, that's how opensource works.
@echarles I'm hoping to have it up within a few weeks.
Hey Folks, Any update on this topic? @scottybrisbane @echarles
Hi, can I ask what sort of data processing people want to use IPFS for?
On Sun, Sep 30, 2018, 05:02 bo-liu [email protected] wrote:
Hey Folks, Any update on this topic? @scottybrisbane https://github.com/scottybrisbane @echarles https://github.com/echarles
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ipfs/ipfs/issues/248#issuecomment-425715959, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAcndBIs-9jQbaw9-nNJW4aHiH7XTLGks5ugLLTgaJpZM4MhHjR .
Hi, can I ask what sort of data processing people want to use IPFS for?
Really depends on what kind of data(or which kind of data you are interested in) stored inside IPFS.
Right, I'm looking to help make sure IPFS is a good fit for the kind of data processing people want to do.
Getting specific examples helps me ensure we're putting effort in the right places.
@scottybrisbane Your work is really interesting! The use case is fascinating. Do you know when you will be able to post your work?
KIP team(@kipfoundation) is working on one of the reference implementation that may align with Scotty's work. More about our implementation of big data persistence will be updated under section 7 here: https://kipfoundation.github.io/techprimer/7-Realm-Storage.html
Look forward to adding HDFS support together!
@scottybrisbane Great work! Are you going to publish your code? I would really like to make use of it in my own thesis, so if you need some help, just hit me up :)
Note: Discussion on applications of IPFS are happening over in the IPFS Forums now ... please continue the discussion there!
This issue is being moved over to the archived repo https://github.com/ipfs/apps/ for reference.
https://github.com/filecoin-project/bacalhau/wiki