ArcticDB
ArcticDB copied to clipboard
Enhancement 8277989680: symbol concatenation poc
Reference Issues/PRs
8277989680
What does this implement or fix?
Implements symbol concatenation. Inner and outer joins over columns both supported. Expected usage:
# Read requests can contain usual as_of, date_range, columns, etc arguments
lazy_dfs = lib.read_batch([read_request_1, read_request_2, ...])
# Potentially apply some processing to all or individual constituent lazy dataframes here, that will be applied before the join
lazy_dfs = lazy_dfs[lazy_dfs["col"].notnull()]
# Join here
lazy_df = adb.concat(lazy_dfs)
# Perform more processing if desired
lazy_df = lazy_df.resample("15min").agg({"col": "mean"})
# Collect result
res = lazy_df.collect()
# res contains a list of VersionedItems from the consituent symbols that went into the join with data=None, and a data member with the joined Series/DataFrame
See test_symbol_concatenation.py for thorough examples of how the API works.
For outer joins, if a column is not present in one of the input symbols, then the same type-specific behaviour as used for dynamic schema is used to backfill the missing values.
Not all symbols can be concatenated together. The following will throw exceptions if attempted to be concatenated:
- a Series with a DataFrame
- Different index types, including multiindexes with different numbers of levels
- Incompatible column types. e.g. if
colhas typeINT64in one symbol, and is a string column in another symbol. this only applies if the column would be in the result, which is always the case for all columns with an outer join, but may not always be for inner joins.
Where possible, the implementation is permissive with what can be joined with an output as sensible as possible:
- Joining two or more Series with different names that are otherwise compatible will produce a Series with no name
- Joining two or more timeseries where the indexes have different names will produce a timeseries with an unnamed index
- Joining two or more timeseries where the indexes have different timezones will produce a timeseries with a UTC index
- Joining two or more multiindexed Series/DataFrames where the levels have compatible types but different names will produce a multiindexed Series/DataFrame with unnamed levels where they differed between some of the inputs.
- Joining two or more Series/DataFrames that all have
RangeIndex. If the indexstepdoes not match between all of the inputs, then the output will have aRangeIndexwithstart=0andstep=1. This is different behaviour to Pandas, which converts to an Int64 index in this case. For this reason, a warning is logged when this happens.
The only known major limitation is that all of the symbols being joined together (after any pre-join processing) must fit into memory. Relaxing this constraint would require much more sophisticated query planning than we currently support, in which all of the clauses both for individual symbols pre-join, the join, and any post-join clauses, are all taken into account when scheduling both IO and individual processing tasks.