arctic
arctic copied to clipboard
POC of Version Store with pluggable backend
First implementation of Version Store with pluggable backends. POC in S3 with read, write, soft delete & snapshots using the existing VersionStore chunking and serialisation mechanisms. Append and hard deletes are not implemented.
This implementation stands alone and has no effect on existing functionality. Contains lots of code duplication form the existing functionality. Has limited error checking and cleanup functionality. This PR is mostly for discussion purposes at this point.
@pablojim awesome - I'll take a look this week!
General implementation notes:
- Uses forward pointers everywhere. Versions point to segments, Snapshots point to Versions
- For Version documents native S3 versioning is used. Snapshotting is just asking S3 for the latest version key of all version docs.
- VersionStore has knowledge of the backingstore while the serialisation classes remain stateless and are handed a backing store for every operation
Random thoughts/possibilities for improvements:
- Add an abstract VersionStore base class
- Implement a backward compatible version of the Mongo VersionStore using this abstraction
- Allow passing of kwargs from all reads and writes to allow customisation for differing backends. e.g. read only certain columns from parquet.
- Need to integrate the new VersionStore with Arctic and libraries - e.g. tie libraries to store type and some specific configuration
- Make use of the S3 metadata functionality - especially when writing the segments write metadata about how it was serialised
- Switch from BSON for the version document serialisation - maybe YAML? Or JSON if we add some date handling.
- Can we achieve chunk sharing with parquet? So we get fast appends/modifications and lower storage usage. It seems possible but would require deep integration when writing the parquet files.
- Multithread the S3 uploads & downloads?
- Handling of different S3 profiles - e.g. multiple S3 endpoints
- Add error checking and verification of S3 writes?
- Add cleanup methods and hard deletes as per existing VersionStore
- Think about fallbacks for parquet serialisation - dataframes in parquet then everything else in pickle?
- is there any value in hybrid approaches - data on NFS and metadata in S3, Mongo, Oracle. Could use transparent urls for reading segment data e.g. s3:// or file:// Configuration would be complex.
Quite a nice bit of work - any comments on performance of the implementations?
@jamesblackburn From some early results for the parquet store - for reading some large objects there are dramatic improvements - 3 seconds vs 90 seconds. These are probably worst case scenarios for arctic. Write performance is not so dramatically affected. I need to test more though.
There would also be large improvements due to being able to load partial frames e.g. only loading selected columns and row groups. This may help cases such as #609.
Still some work and implementation decisions to do though.
@pablojim Shame to let this bitrot, can we discuss later this week with @willdealtry, @shashank88
@pablojim Shame to let this bitrot, can we discuss later this week with @willdealtry, @shashank88
Yeah, this seems pretty good, will go through it tonight. Have fixed the merge conflict. Will see if the tests are fine
If we don’t think this is prod ready or a feature we want to support long term. Maybe we can segregate it is an example and make sure our API allows this sort of flexibility.
Sent with GitHawk
If we don’t think this is prod ready or a feature we want to support long term. Maybe we can segregate it is an example and make sure our API allows this sort of flexibility.
Apart from some dependencies it is completely isolated from the rest of Arctic. It's all in the "pluggable" package and duplicates some code from the main APIs.
One option would be to merge it but mark it in the code and documentation as Beta until it is deemed ready for wider use.
Will move to a contrib directory to unblock this PR, without committing this to be used as a production store.