pdr-backend
pdr-backend copied to clipboard
[Lake, UX] Lake supports >1 writers, incl predictoor bots and >1 pdr lakes
Background / motivation
Approach 0: each app writes lake, in its process. Predictoor bot updates the data lake, then uses the lake, in time for making predictions. Same for other apps in pdr-backend.
Approach 1: separate lake, started separately
- User starts a separate process
pdr lake. It's constantly writing to the data lake - Predictoor bot reads from the lake, but does not write (for safety). Same for other apps.
- This was the idea when we conceived of lake.
But we can do better yet, leveraging the database concept of "locking" which enables >1 writers without hurting DB safety. Writers must handle contention due to locks, eg by waiting.
Approach 2: allow >1 writers.
- We could have >1 lake processes / threads, predictoor bots, or other apps.
- Support two flows:
- Flow 1: quickstart: start
pdr lakeinside the app. Eg user starts onepdr predictoorprocess (and nothing else). The predictoor bot will detect whether a lake process is running, and start one if needed. - Flow 2: power-predictoor usage: start
pdr lakeseparately. Eg a user startspdr lake, then 20pdr predictoorprocesses, one for each feed to predict - Flow 3: power-lake usage: >1 lake processes / threads filling complementary parts of lake (different pairs, different subgraph queries). Eg user starts 1
pdr lakeprocess, and it starts >1 threads. Eg user starts 1 process, then later one, a different one with different goals. Eg >1 users start different processes
- Flow 1: quickstart: start
- Benefits: (a) more convenient: users don't need to kick off the lake process themselves. (b) faster: because parallel fill (c) more flexible: users (or predictoor bots) can start more lake processes without worry
Approach 2 is endgame. The benefits compared to 1 are immense, let alone 0.
Q: Should we go from 0 -> 1 -> 2, or 0->2 directly?
- I (Trent) recommend the 0->2 directly because of the big benefits
- Whereas doing 1 in between would force the user to have to change behavior. (And extra effort for us overall: much of the code we'd write for 1 would be thrown away for 2)
TODOs
- [ ] Locking core: Update lake to support "locking" concept. Such that I could run >1 different pdr lake processes against the same feed, and they wouldn't fight with each other
- [ ] Parallel fill: Update lake to run >1 threads within a single pdr lake process, 1 thread per ohlcv pair or subgraph feed
- [ ] Update predictoor bot: detect whether a lake process is running, and start one if needed.
- [ ] Similarly, update xpmt_engine (nee sim_engine) flow
- [ ] Similarly, update analytics apps flows
- [ ] Ensure READMEs are all updated accordingly. predictoor.md and trader.md should teach the user about how to run
pdr lakeseparately (at the end of the README)
I agree in general that all services (including different agents/bots) would benefit from having a lake that's just up-to-date. And having a process that's solely-responsible for doing this is the way forward.
I think there might be other approaches like "swapping tables", or updating a pointer to the latest table, that might more productive to implement than locking.
What I originally considered was building a base table.py object, that would abstract the schema, return the df, point to a file, etc... The basic structure can be found on table_pdr_predictions, table_pdr_subscriptions. Anything that reads from the lake, would do so through the Table() interface, not DataFactory(). This way, DataFactories are operating on their own, updating the lake, while components/users can access via the interface.
DuckDB only lets you have 1 writer process at a time that holds the db writer connection. Within this, you can then have multiple threads/operating on it. So, for the duckdb "container/process/vm" we should make it as big as possible.
There is now a task for making sure that Lake/ETL has an "update process" that sits there indefinitely looping and updating the lake #1107
Forward Looking:
- We can have requests/queries/etc pushed to "duckdb writer/service" via a simple API
- To solve for "Approach 2" / multiple writers, it would entail a clustered db (I.e. clickhouse)
Priorities have mostly shifted away from pdr-backend. So closing less-critical issues.