pdr-backend [Lake, UX] Lake supports >1 writers, incl predictoor bots and >1 pdr lakes

Background / motivation

Approach 0: each app writes lake, in its process. Predictoor bot updates the data lake, then uses the lake, in time for making predictions. Same for other apps in pdr-backend.

Approach 1: separate lake, started separately

User starts a separate process pdr lake. It's constantly writing to the data lake
Predictoor bot reads from the lake, but does not write (for safety). Same for other apps.
This was the idea when we conceived of lake.

But we can do better yet, leveraging the database concept of "locking" which enables >1 writers without hurting DB safety. Writers must handle contention due to locks, eg by waiting.

Approach 2: allow >1 writers.

We could have >1 lake processes / threads, predictoor bots, or other apps.
Support two flows:
- Flow 1: quickstart: start pdr lake inside the app. Eg user starts one pdr predictoor process (and nothing else). The predictoor bot will detect whether a lake process is running, and start one if needed.
- Flow 2: power-predictoor usage: start pdr lake separately. Eg a user starts pdr lake, then 20 pdr predictoor processes, one for each feed to predict
- Flow 3: power-lake usage: >1 lake processes / threads filling complementary parts of lake (different pairs, different subgraph queries). Eg user starts 1 pdr lake process, and it starts >1 threads. Eg user starts 1 process, then later one, a different one with different goals. Eg >1 users start different processes
Benefits: (a) more convenient: users don't need to kick off the lake process themselves. (b) faster: because parallel fill (c) more flexible: users (or predictoor bots) can start more lake processes without worry

Approach 2 is endgame. The benefits compared to 1 are immense, let alone 0.

Q: Should we go from 0 -> 1 -> 2, or 0->2 directly?

I (Trent) recommend the 0->2 directly because of the big benefits
Whereas doing 1 in between would force the user to have to change behavior. (And extra effort for us overall: much of the code we'd write for 1 would be thrown away for 2)

TODOs

[ ] Locking core: Update lake to support "locking" concept. Such that I could run >1 different pdr lake processes against the same feed, and they wouldn't fight with each other
[ ] Parallel fill: Update lake to run >1 threads within a single pdr lake process, 1 thread per ohlcv pair or subgraph feed
[ ] Update predictoor bot: detect whether a lake process is running, and start one if needed.
[ ] Similarly, update xpmt_engine (nee sim_engine) flow
[ ] Similarly, update analytics apps flows
[ ] Ensure READMEs are all updated accordingly. predictoor.md and trader.md should teach the user about how to run pdr lake separately (at the end of the README)

Jan 20 '24 06:01 trentmc

I agree in general that all services (including different agents/bots) would benefit from having a lake that's just up-to-date. And having a process that's solely-responsible for doing this is the way forward.

I think there might be other approaches like "swapping tables", or updating a pointer to the latest table, that might more productive to implement than locking.

What I originally considered was building a base table.py object, that would abstract the schema, return the df, point to a file, etc... The basic structure can be found on table_pdr_predictions, table_pdr_subscriptions. Anything that reads from the lake, would do so through the Table() interface, not DataFactory(). This way, DataFactories are operating on their own, updating the lake, while components/users can access via the interface.

Jan 23 '24 02:01 idiom-bytes

DuckDB only lets you have 1 writer process at a time that holds the db writer connection. Within this, you can then have multiple threads/operating on it. So, for the duckdb "container/process/vm" we should make it as big as possible.

There is now a task for making sure that Lake/ETL has an "update process" that sits there indefinitely looping and updating the lake #1107

Forward Looking:

We can have requests/queries/etc pushed to "duckdb writer/service" via a simple API
To solve for "Approach 2" / multiple writers, it would entail a clustered db (I.e. clickhouse)

Jun 25 '24 14:06 idiom-bytes

Priorities have mostly shifted away from pdr-backend. So closing less-critical issues.

Jan 25 '25 09:01 trentmc

pdr-backend pdr-backend copied to clipboard

[Lake, UX] Lake supports >1 writers, incl predictoor bots and >1 pdr lakes

Background / motivation

TODOs

pdr-backend
pdr-backend copied to clipboard