Quote ingest using apache stack: arrow / parquet

Open goodboy opened this issue 2 years ago • 0 comments

In Follow up to #486, it'd sure be nice to be able to move away from our current multiprocessing.shared_memory approach for real-time quote/tick ingest and possibly leverage an apache standard format such as arrow and parquet.

As part of improving the .parquet file based tsdb IO from #486 obviously it'd be ideal to support df appends instead of only full overwrites :joy:.

ToDo content from #486

pertaining to StorageClient.write_ohlcv() write on backfills and rt ingest. rn the write is masked out mostly bc there's some details to work out on when/how frequent the writes to parquet files should happen, particularly whether to "append" to parquet files: turns out there's options for appending (faster then overwriting i guess?) to parquet, particularly using fastparquet, see the below resources:

[ ] for python we can likely use: https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write
- also note the times options with the int96 format which embeds nanoseconds B)
- the custom_metadata: dict can only be used on overwrite :eyes:
  - can use the https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.update_file_custom_metadata to update metadata if needed?
[ ] https://stackoverflow.com/questions/39234391/how-to-append-data-to-an-existing-parquet-file
[ ] https://stackoverflow.com/questions/47191675/pandas-write-dataframe-to-parquet-format-with-append/74209756#74209756
[ ] other langs and spark related:
- https://issues.apache.org/jira/browse/PARQUET-1022
- https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/SaveMode.html
- https://stackoverflow.com/questions/39234391/how-to-append-data-to-an-existing-parquet-file/42140475#42140475

Oct 31 '23 15:10 goodboy