feast
feast copied to clipboard
Creating "event_timestamp" in a pandas dataframe leading to error when retrieving historical features
Expected Behavior
Given there is a BigQuery table containing week_start_date_features
with the following data types:
Field name | Type |
---|---|
week_start_date_id | DATETIME |
week_start_date_day_of_month | INTEGER |
timestamp | DATETIME |
I would expect the following to work:
fs = FeatureStore(repo_path="./")
feature_service = fs.get_feature_service("week_start_date_features")
entity_df = pd.DataFrame.from_dict(
{
"week_start_date_id": ["2018-09-25"],
"event_timestamp": [datetime(2021, 4, 12, 10, 59, 42)],
}
)
training_df = fs.get_historical_features(
features=feature_service,
entity_df=entity_df,
).to_df()
Current Behavior
When the above code is run, I receive the following error:
400 No matching signature for operator <= for argument types: DATETIME, TIMESTAMP. Supported signature: ANY <= ANY
Steps to reproduce
Create a BigQuery table, with the columns and datatypes as described above and try running the above code snippet.
Specifications
- Version: Feast 0.15.1
- Platform: Google Cloud Platform
- Subsystem:
Possible Solution
When you create datetime columns in Pandas they have the scalar type of TimeStamp
. I've tried converting them explicitly to using the to_pydatetime()
pandas method (https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.to_pydatetime.html), using the pd.to_datetime()
method as done in Step 5 of this Feast tutorial: https://github.com/feast-dev/feast-driver-ranking-tutorial/blob/master/notebooks/Driver_Ranking_Tutorial.ipynb, neither seemed to work. What might be a solution is to allow for data schemas to be expliticly passed to the get_historical_feature
method similar is done with the pandas to_gbq
method (see: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_gbq.html), though this might be only a Google Cloud Platform problem and then this solution might not be interoperable with AWS.
Maybe one more followup. I've tried two more things -
- Adding an
event_timestamp
column with BigQuery datatypeDATETIME
to the BigQuery table which we pull from to create our 'entity_df. We pull the table using the BigQuery sdk as follows -
bigquery.Client().query(some_query).result().to_dataframe(). We then pass this
entity_dfto Feast's
get_historical_features` as described above, and receive the same error. We also tried converting the datatype in pandas, as described in the "possible solution' above.
- Instead of passing a
entity_df
Pandas DataFrame toget_historical_features
, if we just pass it a string containing a BigQuery SQL query, it works does not throw the error and returns a DataFrame with the joined features.
So it seems we can just use option 2 for now but the original problem I described exists.
I think there is also the issue that a BigQuery DATETIME
column does not actually represent a specific point in time as it is not timezone aware. Arguably, your BigQuery table should use TIMESTAMP
not DATETIME
.
To add to this the reason that to_pydatetime()
doesn't help you is that BigQuery uses Parquet to upload a Pandas Dataframe - using PyArrow to convert the DataFrame to a Parquet file. The type mapping to Arrow is defined here: https://arrow.apache.org/docs/python/pandas.html#datetime-timestamp-types. As you can see Python datetime objects map to Arrow timestamp types in order to maintain timezone awareness.
BigQuery then conducts its own type mapping on the Parquet file as defined here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#type_conversions. Again, to maintain timezone awareness TIMESTAMP
is used rather than DATETIME
.
One 'solution' here would be to cast the event_timestamp
column to TIMESTAMP
(without defining a timezone, which I think defaults to UTC) in all the BigQuery query templates in Feast but I think this is likely to cause more timezone issues than it is likely to solve 😕
In conclusion, it would be my stance that your original BigQuery schema is poorly defined and I'm not sure there is really a problem in Feast other than the lack of a helpful error message.
Instead of passing a entity_df Pandas DataFrame to get_historical_features, if we just pass it a string containing a BigQuery SQL query, it works does not throw the error and returns a DataFrame with the joined features.
It makes sense that this would work as both sides of the join would be DATETIME
, however, the point of a DATETIME
not actually representing a point in time still stands and so this is arguably more of a bug than a feature.
Hi @judahrand thanks for the tips. I've tried casting the event_timestamp
to TIMESTAMP
and tried also just defining the data type in the BigQuery table as TIMESTAMP
, but this does not resolve the issue
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.