feast icon indicating copy to clipboard operation
feast copied to clipboard

Creating "event_timestamp" in a pandas dataframe leading to error when retrieving historical features

Open tblazina opened this issue 3 years ago • 4 comments

Expected Behavior

Given there is a BigQuery table containing week_start_date_features with the following data types:

Field name Type
week_start_date_id DATETIME
week_start_date_day_of_month INTEGER
timestamp DATETIME

I would expect the following to work:

fs = FeatureStore(repo_path="./")
feature_service = fs.get_feature_service("week_start_date_features")
entity_df = pd.DataFrame.from_dict(
    {
        "week_start_date_id": ["2018-09-25"],
        "event_timestamp": [datetime(2021, 4, 12, 10, 59, 42)],
    }
)
training_df = fs.get_historical_features(
    features=feature_service,
    entity_df=entity_df,
).to_df()

Current Behavior

When the above code is run, I receive the following error: 400 No matching signature for operator <= for argument types: DATETIME, TIMESTAMP. Supported signature: ANY <= ANY

Steps to reproduce

Create a BigQuery table, with the columns and datatypes as described above and try running the above code snippet.

Specifications

  • Version: Feast 0.15.1
  • Platform: Google Cloud Platform
  • Subsystem:

Possible Solution

When you create datetime columns in Pandas they have the scalar type of TimeStamp. I've tried converting them explicitly to using the to_pydatetime() pandas method (https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.to_pydatetime.html), using the pd.to_datetime() method as done in Step 5 of this Feast tutorial: https://github.com/feast-dev/feast-driver-ranking-tutorial/blob/master/notebooks/Driver_Ranking_Tutorial.ipynb, neither seemed to work. What might be a solution is to allow for data schemas to be expliticly passed to the get_historical_feature method similar is done with the pandas to_gbq method (see: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_gbq.html), though this might be only a Google Cloud Platform problem and then this solution might not be interoperable with AWS.

tblazina avatar Nov 17 '21 21:11 tblazina

Maybe one more followup. I've tried two more things -

  1. Adding an event_timestamp column with BigQuery datatype DATETIME to the BigQuery table which we pull from to create our 'entity_df. We pull the table using the BigQuery sdk as follows - bigquery.Client().query(some_query).result().to_dataframe(). We then pass this entity_dfto Feast'sget_historical_features` as described above, and receive the same error. We also tried converting the datatype in pandas, as described in the "possible solution' above.
  1. Instead of passing a entity_df Pandas DataFrame to get_historical_features, if we just pass it a string containing a BigQuery SQL query, it works does not throw the error and returns a DataFrame with the joined features.

So it seems we can just use option 2 for now but the original problem I described exists.

tblazina avatar Nov 18 '21 08:11 tblazina

I think there is also the issue that a BigQuery DATETIME column does not actually represent a specific point in time as it is not timezone aware. Arguably, your BigQuery table should use TIMESTAMP not DATETIME.

To add to this the reason that to_pydatetime() doesn't help you is that BigQuery uses Parquet to upload a Pandas Dataframe - using PyArrow to convert the DataFrame to a Parquet file. The type mapping to Arrow is defined here: https://arrow.apache.org/docs/python/pandas.html#datetime-timestamp-types. As you can see Python datetime objects map to Arrow timestamp types in order to maintain timezone awareness.

BigQuery then conducts its own type mapping on the Parquet file as defined here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#type_conversions. Again, to maintain timezone awareness TIMESTAMP is used rather than DATETIME.

One 'solution' here would be to cast the event_timestamp column to TIMESTAMP(without defining a timezone, which I think defaults to UTC) in all the BigQuery query templates in Feast but I think this is likely to cause more timezone issues than it is likely to solve 😕

In conclusion, it would be my stance that your original BigQuery schema is poorly defined and I'm not sure there is really a problem in Feast other than the lack of a helpful error message.

judahrand avatar Dec 03 '21 14:12 judahrand

Instead of passing a entity_df Pandas DataFrame to get_historical_features, if we just pass it a string containing a BigQuery SQL query, it works does not throw the error and returns a DataFrame with the joined features.

It makes sense that this would work as both sides of the join would be DATETIME, however, the point of a DATETIME not actually representing a point in time still stands and so this is arguably more of a bug than a feature.

judahrand avatar Dec 03 '21 14:12 judahrand

Hi @judahrand thanks for the tips. I've tried casting the event_timestamp to TIMESTAMP and tried also just defining the data type in the BigQuery table as TIMESTAMP, but this does not resolve the issue

tblazina avatar Dec 04 '21 15:12 tblazina

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Dec 20 '22 23:12 stale[bot]