Move sort & join operations to SQLite when possible
This is an experimental feature that might not work at all.
So the idea is to whenever possible (or when configured) to offload those heavy tasks to better optimized engine.
We should start from sorting since its much easier. What needs to happen is following:
- create sqlite db with columns reflecting sort by columns, and their types.
- serialize the entire row and save it as serialized string in blob column
- read from sorted results of SQL query and yield out unserialized row tomake sure nothing changed
One important thing here is that sqlite db should be created and configured in a way that allows for:
- setting maximum allowed memory consumption
- we need to allow to pass serializer that would allow encrypt serialized row in order to avoid any data leaks
- once reading is done db should be removed
- to defines where that temporary db is created (in memory should also be an option)
Once a proof of concept is created we should measure it performance and compare with native sort. The main bottleneck might happen at writing/reading to/from sqlite.
My recommendation would be to start with sqlite3 extension and if it works provide also pdo_sqlite alternative.
I would make it part of the core/etl and detect if sqlite is even available
@norberttech I think we should investigate if we should offload to the sqlite only the necessary fields that are used for sorting instead of inserting the whole row. Then, based on the returned ids, that could be the hash of the row or the index in the array, we return the real data.
@norberttech I think we should investigate if we should offload to the sqlite only the necessary fields that are used for sorting instead of inserting the whole row. Then, based on the returned ids, that could be the hash of the row or the index in the array, we return the real data.
So let say we want to sort by id, date
Then in SQLite db, we need to create an id column (type based on the flow column type probably int) and date column (probably also int as there is no native date type).
Once those columns are created, one more is needed: _serialized_dataframe_row_ or something similar with the entire row serialized.
So then when we execute the query and iterate over the results we only read that _serialized_dataframe_row_ and yield it as is without any type conversions (the id and date will stay exactly as they were serialized).
Why do we need to store those serialized rows that are not even used in the sort operation? Because otherwise, we would need to go to the source dataset and read the data in exactly the same order as an output of the sort operation which for most of the datasets would not be scalable at all.
We also can't keep the whole dataset in the memory when SQLite sort operation is executed as datasets might be bigger than available memory.
That's why I think the only reasonable option is to store serialized rows in the SQLite and just yield them back directly from it but maybe I'm missing something here.
I hope it makes more sense now!
The main reason why I was trying not to store a serialized row in the database is that serialization is also a really expensive operation. But, your arguments seem valid ones. I will leave the theory for now since creating a PoC will help us test any assumptions.
The main reason why I was trying not to store a serialized row in the database is that serialization is also a really expensive operation. But, your arguments seem valid ones. I will leave the theory for now since creating a PoC will help us test any assumptions.
if we could figure out a way to avoid serialization that would be fantastic, but I’m not sure if that's even possible :/