Quote ingest using apache stack: arrow / parquet
In Follow up to #486, it'd sure be nice to be able to move away
from our current multiprocessing.shared_memory approach for
real-time quote/tick ingest and possibly leverage an apache
standard format such as arrow and parquet.
As part of improving the .parquet file based tsdb IO from #486
obviously it'd be ideal to support df appends instead of only full
overwrites :joy:.
ToDo content from #486
pertaining to StorageClient.write_ohlcv() write on backfills and
rt ingest. rn the write is masked out mostly bc there's some
details to work out on when/how frequent the writes to parquet
files should happen, particularly whether to "append" to parquet
files: turns out there's options for appending (faster then
overwriting i guess?) to parquet, particularly using fastparquet,
see the below resources:
-
[ ] for python we can likely use: https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write
- also note the
timesoptions with the int96 format which embeds nanoseconds B) - the
custom_metadata: dict can only be used on overwrite :eyes:- can use the https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.update_file_custom_metadata to update metadata if needed?
- also note the
-
[ ] https://stackoverflow.com/questions/39234391/how-to-append-data-to-an-existing-parquet-file
-
[ ] https://stackoverflow.com/questions/47191675/pandas-write-dataframe-to-parquet-format-with-append/74209756#74209756
-
[ ] other langs and spark related:
- https://issues.apache.org/jira/browse/PARQUET-1022
- https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/SaveMode.html
- https://stackoverflow.com/questions/39234391/how-to-append-data-to-an-existing-parquet-file/42140475#42140475