butterfree icon indicating copy to clipboard operation
butterfree copied to clipboard

Pivot missing categories breaks FeatureSet/AggregatedFeatureSet

Open AlvaroMarquesAndrade opened this issue 3 years ago • 0 comments

Pivot missing categories breaks FeatureSet/AggregatedFeatureSet

Summary

When defining a feature set, it's expected that pivot will have all categories and, as a consequence, the resulting Source dataframe will be suitable to be transformed. When a different behavior happens, FeatureSet and AggregatedFeatureSet breaks.

Feature related:

Age: legacy

Estimated cost: investigation_needed

Type: documentation, coding and testing.

Description :clipboard:

If we have a pivot transformation defined in a reader, it's straightforward to define the expected categories as features during FeatureSet or AggregatedFeatureSet instantiation. If for some reason, not all categories are found in the Source resulting dataframe (this could happen if we use a smaller time window, for instance), then our feature set will break due to not finding this expected column.

In order to illustrate what's happening, suppose we have the following resulting dataframe from the Source:

    +---+---+-------+------+----+-----+
    | id| ts|balcony|fridge|oven| pool|
    +---+---+-------+------+----+-----+
    |  1|  1|   null| false|true|false|
    |  2|  2|  false|  null|null| null|
    |  1|  3|   null|  null|null| null|
    |  1|  4|   null|  null|null| true|
    |  1|  5|   true|  null|null| null|
    +---+---+-------+------+----+-----+

As a result, a possible AggregatedFeatureSet could be:

aggregated_feature_set=AggregatedFeatureSet(
                name="example_agg_feature_set",
                entity="entity",
                description="Just a single example. "
                keys=[
                    KeyFeature(
                        name="id",
                        description="House id.",
                        dtype=DataType.BIGINT,
                    )
                ],
                timestamp=TimestampFeature(from_column="ts"),
                features=[
                    Feature(
                        name="balcony_amenity",
                        description="description",
                        transformation=AggregatedTransform(
                            functions=[Function(functions.count, DataType.INTEGER)]
                        ),
                        from_column="balcony",
                    ),
                    Feature(
                        name="fridge_amenity",
                        description="description",
                        transformation=AggregatedTransform(
                            functions=[Function(functions.count, DataType.INTEGER)]
                        ),
                        from_column="fridge",
                    ),
                    Feature(
                        name="oven_amenity",
                        description="description",
                        transformation=AggregatedTransform(
                            functions=[Function(functions.count, DataType.INTEGER)]
                        ),
                        from_column="oven",
                    ),
                    Feature(
                        name="pool_amenity",
                        description="description",
                        transformation=AggregatedTransform(
                            functions=[Function(functions.count, DataType.INTEGER)]
                        ),
                        from_column="pool",
                    ),
                ],
            )

Now, if we take a different time window and, for some reason, there is no information regarding the pool amenity, we'd have a resulting Source dataframe like this:

    +---+---+-------+------+----+
    | id| ts|balcony|fridge|oven|
    +---+---+-------+------+----+
    |  1|  6|   null| false|true|
    |  2|  7|  false|  null|null|
    |  1|  8|   null|  null|null|
    |  1|  9|   null|  null|null|
    +---+---+-------+------+----+

Therefore, the pool_amenity feature would break, since there's no pool column anymore.

Impact :bomb:

We'll not be able to use the pivot operation for incremental loads (since we can't be sure that all categories will be available).

Solution Hints :shipit:

We could have a parameter for making a given feature optional. As a result, the expected behavior should be the following: if the column that this feature is dependent exists, then we perform the transformations, otherwise we could simply consider as null (we could raise a warning in these cases).

Observations :thinking:

We should take care, when implementing this solution, to avoid hiding errors.

AlvaroMarquesAndrade avatar Sep 17 '20 14:09 AlvaroMarquesAndrade