datapackage-pipelines
datapackage-pipelines copied to clipboard
Spawn processors in a pipeline for parallelism?
Description
@danfowler has recently been using DPP for some very large source files (8GB CSV). With the default way pipelines here work, processing this data via a single conceptual stream is too slow.
There are various ways to deal with this:
- Don't use streaming/DPP for such large files. Copy the data into a DB and use that.
- Try to use more efficient file backends for such files, such as HDF5, Feather, or mmapped files.
- Allow DPP to spawn multiple processing (sub-)pipelines, which end in a sink processor.
- ?
I'd like to explore option 3. @akariv what are your thoughts?
From my very high-level view of the framework, there are no inherent design barriers to this. The only thing I guess is that the descriptor object is mutable global state, and maybe the caching mechanisms too. This might mean that spawned processors should only work on the data sources, and not on metadata.
This is in fact describing a map/reduce framework, no? I think that it might be better to leave the task of orchestrating the map/reduce tasks to a dedicated framework (e.g. Hadoop). The tasks themselves, however, could be implemented with dpp, of course.
Yes, it is map/reduce. I guess we can explore how we use other frameworks like Hadoop for orchestration.