ArcticDB Enhancement 8277989680: symbol concatenation poc

Enhancement 8277989680: symbol concatenation poc

Open alexowens90 opened this issue 10 months ago • 1 comments

Reference Issues/PRs

8277989680

What does this implement or fix?

Implements symbol concatenation. Inner and outer joins over columns both supported. Expected usage:

# Read requests can contain usual as_of, date_range, columns, etc arguments
lazy_dfs = lib.read_batch([read_request_1, read_request_2, ...])
# Potentially apply some processing to all or individual constituent lazy dataframes here, that will be applied before the join
lazy_dfs = lazy_dfs[lazy_dfs["col"].notnull()]
# Join here
lazy_df = adb.concat(lazy_dfs)
# Perform more processing if desired
lazy_df = lazy_df.resample("15min").agg({"col": "mean"})
# Collect result
res = lazy_df.collect()
# res contains a list of VersionedItems from the consituent symbols that went into the join with data=None, and a data member with the joined Series/DataFrame

See test_symbol_concatenation.py for thorough examples of how the API works. For outer joins, if a column is not present in one of the input symbols, then the same type-specific behaviour as used for dynamic schema is used to backfill the missing values. Not all symbols can be concatenated together. The following will throw exceptions if attempted to be concatenated:

a Series with a DataFrame
Different index types, including multiindexes with different numbers of levels
Incompatible column types. e.g. if col has type INT64 in one symbol, and is a string column in another symbol. this only applies if the column would be in the result, which is always the case for all columns with an outer join, but may not always be for inner joins.

Where possible, the implementation is permissive with what can be joined with an output as sensible as possible:

Joining two or more Series with different names that are otherwise compatible will produce a Series with no name
Joining two or more timeseries where the indexes have different names will produce a timeseries with an unnamed index
Joining two or more timeseries where the indexes have different timezones will produce a timeseries with a UTC index
Joining two or more multiindexed Series/DataFrames where the levels have compatible types but different names will produce a multiindexed Series/DataFrame with unnamed levels where they differed between some of the inputs.
Joining two or more Series/DataFrames that all have RangeIndex. If the index step does not match between all of the inputs, then the output will have a RangeIndex with start=0 and step=1. This is different behaviour to Pandas, which converts to an Int64 index in this case. For this reason, a warning is logged when this happens.

The only known major limitation is that all of the symbols being joined together (after any pre-join processing) must fit into memory. Relaxing this constraint would require much more sophisticated query planning than we currently support, in which all of the clauses both for individual symbols pre-join, the join, and any post-join clauses, are all taken into account when scheduling both IO and individual processing tasks.

Jan 27 '25 10:01 alexowens90

ArcticDB ArcticDB copied to clipboard

Enhancement 8277989680: symbol concatenation poc

Reference Issues/PRs

What does this implement or fix?

ArcticDB
ArcticDB copied to clipboard