Parallel Processing
Flow is amazing, but the true power would come with a parallel processing of datasets. The absolute minimum we need to prepare for that is:
- parallel data extraction
- dataframe serialization
- data shuffling (to move data between workers)
We also need to define the communication protocol between workers/nodes (something like thrift or protobuf)
For the last few weeks I have been mostly enjoying vacations (that's why there was a gap in releases). Before that I started working on the playground project, but during my break I have been thinking quite a lot about the project.
Long story short I think I'm going to shift the priorities slightly and delay the playground a bit (probably until next year) and focus on parallel processing which seems the be the biggest missing piece.
What is parallel processing?
Reading datasets in chunks (splitting them into X chunks and processing on Y processors) but also writing them in chunks. For now the idea is to us child processes with a communication protocol based on Apache Thrift
The thrift protocol would require following building blocks definitions (to serialize them and send across processes)
- Rows / Row
- Loaders
- Transformers
- Scalar Functions
- Pipelines (not yet 100% sure about this one)
In the first phase only following data sources are going to be supported:
- CSV
- Parquet
- Doctrine DBAL
In the second phase we should add:
- JSON
- XML
- Elasticsearch
- MeiliSearch
- HTTP
The most tricky part is going to be data shuffeling between processors for operations like: Sorting / Grouping / Aggregating / Pivoting Those algorithms usually requires access to the entire dataset (at least at some pooint), other than that most of the pieces are already in place.
So in general there are going to be 2 independant protocols used here:
- Messages Protocol - this is going to be based on Thrift (binary serialization)
- Communication Protocol - how to exchange messages between processes
There is something called MPI https://pl.wikipedia.org/wiki/Message_Passing_Interface - which is exactly what I'm looking for but unfortunately there are no implementaitons in PHP. Apache Spark for example uses https://netty.io/ (a lib for client/server communication), which is on the conceptual level very similar to MPI,
So it seems I have no other choice than design something very similar, a Client / Server communication library that would be an abstraction layer. As for the implementation I'm thinking about creating adapters for HTTP2.0 and TCP and as for the implementation:
- ReactPHP
- AMPHP