butterfree
butterfree copied to clipboard
Pivot missing categories breaks FeatureSet/AggregatedFeatureSet
Pivot missing categories breaks FeatureSet/AggregatedFeatureSet
Summary
When defining a feature set, it's expected that pivot
will have all categories and, as a consequence, the resulting Source
dataframe will be suitable to be transformed. When a different behavior happens, FeatureSet
and AggregatedFeatureSet
breaks.
Feature related:
Age: legacy
Estimated cost: investigation_needed
Type: documentation, coding and testing.
Description :clipboard:
If we have a pivot
transformation defined in a reader, it's straightforward to define the expected categories as features during FeatureSet
or AggregatedFeatureSet
instantiation. If for some reason, not all categories are found in the Source
resulting dataframe (this could happen if we use a smaller time window, for instance), then our feature set will break due to not finding this expected column.
In order to illustrate what's happening, suppose we have the following resulting dataframe from the Source
:
+---+---+-------+------+----+-----+
| id| ts|balcony|fridge|oven| pool|
+---+---+-------+------+----+-----+
| 1| 1| null| false|true|false|
| 2| 2| false| null|null| null|
| 1| 3| null| null|null| null|
| 1| 4| null| null|null| true|
| 1| 5| true| null|null| null|
+---+---+-------+------+----+-----+
As a result, a possible AggregatedFeatureSet could be:
aggregated_feature_set=AggregatedFeatureSet(
name="example_agg_feature_set",
entity="entity",
description="Just a single example. "
keys=[
KeyFeature(
name="id",
description="House id.",
dtype=DataType.BIGINT,
)
],
timestamp=TimestampFeature(from_column="ts"),
features=[
Feature(
name="balcony_amenity",
description="description",
transformation=AggregatedTransform(
functions=[Function(functions.count, DataType.INTEGER)]
),
from_column="balcony",
),
Feature(
name="fridge_amenity",
description="description",
transformation=AggregatedTransform(
functions=[Function(functions.count, DataType.INTEGER)]
),
from_column="fridge",
),
Feature(
name="oven_amenity",
description="description",
transformation=AggregatedTransform(
functions=[Function(functions.count, DataType.INTEGER)]
),
from_column="oven",
),
Feature(
name="pool_amenity",
description="description",
transformation=AggregatedTransform(
functions=[Function(functions.count, DataType.INTEGER)]
),
from_column="pool",
),
],
)
Now, if we take a different time window and, for some reason, there is no information regarding the pool
amenity, we'd have a resulting Source
dataframe like this:
+---+---+-------+------+----+
| id| ts|balcony|fridge|oven|
+---+---+-------+------+----+
| 1| 6| null| false|true|
| 2| 7| false| null|null|
| 1| 8| null| null|null|
| 1| 9| null| null|null|
+---+---+-------+------+----+
Therefore, the pool_amenity
feature would break, since there's no pool
column anymore.
Impact :bomb:
We'll not be able to use the pivot operation for incremental loads (since we can't be sure that all categories will be available).
Solution Hints :shipit:
We could have a parameter for making a given feature optional
. As a result, the expected behavior should be the following: if the column that this feature is dependent exists, then we perform the transformations, otherwise we could simply consider as null
(we could raise a warning in these cases).
Observations :thinking:
We should take care, when implementing this solution, to avoid hiding errors.