Start `piker.storage` subsys: cross-(ts)db middlewares
Launch pad for work towards the task list in #485 🏄🏼
As a start this introduces a new piker.storage subsystem to
provide for database related middleware(s) as well as new storage
backend using polars and apache parquet files to implement
a built-in, local-filesystem managed "time series database":
nativedb.
After some extensive tinkering and brief performance measures I'm tempted to go all in on this home grown solution for a variety of reasons (see details in 27932e44) but re-summarizing some of them here:
- wayy faster load times with no "datums to load limit" required.
- smaller space footprint and we haven't even touched compression settings yet!
- wayyy more compatible with other systems which can lever the apache ecosystem.
- gives us finer grained control over the filesystem usage so we can choose to swap out stuff like the replication system or networking access.
-
polarsalready has a multi-db compat layer with multi-engine support we can leverage and completely sidestep integration work with multiple standard tsdbs?- https://pola-rs.github.io/polars-book/user-guide/io/database/
Core dev discusssion
-
[ ] we've put some work into
marketstoresupport machinery including:- docker supervisor subsys, spawning, monitoring (which is still super useful going forward FWIW!).
-
anyio-marketstorean async client written and maintained by our devs. - originally using the rt ingest and re-publish over websocket features. AND the question is whether we're ok abandoning some of this and/or reimplementing it on our own around the new apache data file/model ecosystems?
-
[ ] we can definitely accomplish ingest, pub-sub and replication on our own (without really much effort) with the following existing subsystems and frameworks:
- ingest:
tractoractor which writes to apache arrow (IPC) files and flushes to parquet on size constraints. - pub-sub: again with
tractoractor andtrio-websocket - replication: use a modern filesystem (btrfs or zfs) and or
something like
borg(with it's unofficial API client) to accomplish file syncing across many user-hosts.-
borghas a community API: https://github.com/spslater/borgapi- https://borgbackup.readthedocs.io/en/stable/quickstart.html#remote-repositories
- https://github.com/borgbackup/borg
- https://borgbackup.readthedocs.io/en/stable/faq.html#usage-limitations
- https://github.com/borgbackup/community
- other file systems:
- https://nixos.wiki/wiki/ZFS
-
- ingest:
-
[ ] should we drop all the existing
marketstorecode?- it's quite a bit of noise and not going to work anyway given
the new implementation changes in the
.data.historylayer. - the issues we've reported are not getting resolved and are more or less deal breakers, like https://github.com/pikers/piker/issues/443
- likely the new
arcticdbis a better solution longer run then mkts was anyway given it's large insti usage..?
- it's quite a bit of noise and not going to work anyway given
the new implementation changes in the
ToDo:
-
[x] CHERRY from #519:
- 641726b
- b5d66b3
- cc7fe6f3
- d6f3fd9a
- b6b53f71
- b71c508b
- a295312a
- 4ae6c367
- f5c9659a
- fcd45766
-
[ ] CHERRY from #528
- 38b10fa3
- 281cfcc6
- 7802febd: backfill gaps with pre-gap close
-
[ ] outstanding obvious regression due to this patch set :joy:
- [ ] on ib stocks slow to fast chart projection region is now way
off:
=> pretty sure this is fixed now after reworking the gap
filling logic inside
.data.history.start_backfill()
- [ ] on ib stocks slow to fast chart projection region is now way
off:
=> pretty sure this is fixed now after reworking the gap
filling logic inside
-
[ ] drop market store code in general depending on outcome of above discussion.
- [ ] drop
.storage.marketstoreandanyio-marketstoredep? - [ ] wipe the supervisor code using the
.service._ahablayer? - [ ] cleaning up remaining unused and now commented code from
.data.history!
- [ ] drop
-
from https://github.com/pikers/piker/issues/485
- [ ] making
.storagewith subpkgs for backends and an API / mgmt layer
- [ ] making
-
outstanding tsdb bugs:
- #436
- #323
-
docs on new filesystem layout and config options:
- [ ]
nativedb/dir - [ ] add
[storage]section toconf.toml:[storage] datadir = 'nativedb' fspdir = 'fsp' ohlcvdir = 'ohlcv' shm_buffer_size = '80Mb' parquet_compression = 'snappy' parquet_lib = 'fastparquet' replication_backend = 'borg' # these hosts would be looked up in the network section and # contacted appropriately based on IPC info from there? replication_dsts = ['hostname1', 'hostname2']
- [ ]
-
from #312 we need chart-UI integration for a buncha stuff:
- [ ] main thing to get done would be a context-menu
reload historyfor a highlighted section or gap B)
- [ ] main thing to get done would be a context-menu
-
[ ]
.storage.clirefinement:- [ ] (#313) documenting
--tsdbis no longer needed since we don't need to offer optional docker activation, since we don't need it usingnativedbbackend! - [ ] tidying up and formalizing the set of
piker storecmds - [ ] making the
analsubcmd do gap detection and discrepancy reporting (at the least) against market-venue known operating hours.
- [ ] (#313) documenting
-
[ ] new
nativedbackend implemented withpolars+ apache parquet files B)-
[x] since we're already moving to use
typerin #489, let's also add confirmation support for the newpikerd storage -dflag:-
added and used in the new
.storage.cli! - [ ] do confirms for deletes? https://typer.tiangolo.com/tutorial/prompt/#confirm
-
added and used in the new
-
[ ] gap backfilling (as detailed in https://github.com/pikers/piker/pull/486/commits/f45b76ed77eafdf44871d3e3305f7dc18e9de938) still requires some work for full functionality including:
- [ ] UI needs a cross-actor event in the history chart's update loop to ensure we do a forced graphics data formatter update when gap-backfilling is complete.
-
[x] rt ingest and fast parquest update deferred to #536
-
[ ] currently we aren't storing rt data (received during data session but not previously written to storage) on teardown..
- consider writing the arrow IPC files and then flushing to dfs and then parquet at some frequency / teardown?
-
[ ] related to above, what about for FSP ingest and storage?
- [ ] https://github.com/pikers/piker/issues/314 probably
should be re-created but for
nativedband a new writeup around arrow IPC and feather formats?
- [ ] https://github.com/pikers/piker/issues/314 probably
should be re-created but for
-
[ ] (likely as follow up) use the lazy
polars API to do larger then mem processing both for charting and remote (host) processing:- from the guide:
- https://pola-rs.github.io/polars-book/user-guide/lazy/using/
- https://pola-rs.github.io/polars-book/user-guide/concepts/lazy-vs-eager/
- https://pola-rs.github.io/polars-book/user-guide/io/parquet/#scan
- from API docs:
- https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_pyarrow_dataset.html
- https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_parquet.html
- https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_ipc.html
- from the guide:
-
[ ] use
polarsto do price series anomaly repairs, such as is causes by stock splits or for handling bugs in data providers where a ticker name was repurposed for a new asset and the price history has mega gap: -
[ ] deciding on file organization, naming schema, subdirs for piker subsystems, etc.
- [ ] should we store multiple files segmented by some time period and then simply use the multiple files reader support: https://pola-rs.github.io/polars-book/user-guide/io/multiple/
- [ ] current file naming scheme is
mnq.cme.20230616.ib.ohlcv1s.parquetbut we can probably change the meta-data token partohlcv1sto be more parse-able and readable?- put
.in:ohlcv.1s.<otherinfo>? - what do we do for fsp stuff, at least
a
.config/piker/nativedb/fsp/subdir?
- put
- [ ] what is writing deltas and can we use it?
- https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_delta.html#polars.DataFrame.write_delta
-
We can convert this to a draft if necessary if/when #483 lands
I'm in favor of doing or own solution and I would rather stop maintaining any marketstore related coded, in the end we were almost gonna spend as much work mantaining marketstore that just doing our own thing right.
I'm in favor of doing or own solution and I would rather stop maintaining any marketstore related coded, in the end we were almost gonna spend as much work mantaining marketstore that just doing our own thing right.
yup totally agree!
ok then i'll be putting up some finishing functionality touches, hopefully tests, and then dropping all that junk 🏄🏼
To give an idea of what the parquet subdir looks like now, much in similarity to how marketstore laid out it's own internal per table binary format files except using less space and actually being a file type data people can use 😂