feast
feast copied to clipboard
Latest Only option for Historical Retrieval
Is your feature request related to a problem? Please describe.
In many batch workflows, it is worthwhile to retrieve the latest features by entity only. This is useful from the purposes of both production and backtesting purposes.
E.g. if I have an hourly/daily batch which goes through our whole customer base to find fraudulent customers, we wouldn't really use the online store for this.
Describe the solution you'd like
Allow users to specify an entity set extracted from a feature view should have an option to be deduplicated by latest. Depends on #1611
my_daily_batch_scoring_df = store.get_latest_features(
entity_df = "my_df",
feature_refs = [...],
)
Additional context Linked issue #1611
Thanks for raising this @charliec443
This is useful from the purposes of both production and backtesting purposes
I think it would be useful to be explicit in your problem statement. What aspect of the existing API makes it incapable (or inconvenient) for your use case? Why are the latest values used for backtesting, and not historic values? I would have expected backtesting to use historic values.
if I have an hourly/daily batch which goes through our whole customer base to find fraudulent customers, we wouldn't really use the online store for this.
The first part of this sentence doesn't really connect to the second. I'm a bit confused as to what you mean.
Allow users to specify an entity set extracted from a feature view should have an option to be deduplicated by latest. Depends on #1611
@MattDelac is this API moving closer to what you are using internally?
@MattDelac is this API moving closer to what you are using internally?
Not really
But we have the same need for batch predictions where we want to predict the latest values of the features in batch. Therefore we could bypass the historical retrieval logic and have a SQL template that is much more efficient.
In terms of API i would rather have another API eg: store.get_latest_features() rather than a boolean parameter.
And as I said, store.get_latest_features() could be a very efficient SQL query
Hope that makes sense
@MattDelac is this API moving closer to what you are using internally?
Not really
But we have the same need for batch predictions where we want to predict the latest values of the features in batch. Therefore we could bypass the historical retrieval logic and have a SQL template that is much more efficient.
In terms of API i would rather have another API eg:
store.get_latest_features()rather than a boolean parameter. And as I said,store.get_latest_features()could be a very efficient SQL queryHope that makes sense
store.get_latest_features() could be a shared method that is also used for materialization into the online store. Seems like a good idea to me.
The first part of this sentence doesn't really connect to the second. I'm a bit confused as to what you mean.
That's fair because I don't have a clear vision right now. Where the existing API might be clunky for back testing in batch is because we might want to partition by a whole feature view, which can't easily be filtered by time (and I'm more than happy to be challenged that this is a "too hard" or I'm doing it wrong)
Prediction problem: Fraud detection over the customer base Input feature groups:
- Customer demographics
- Customer event interaction
- Customer call transcript
Sample data:
Customer demographics
| CUST_NUM | GENDER | START_DATE |
|---|---|---|
| 123 | F | 2001-01-01 |
| 456 | M | 2001-01-01 |
| 789 | NA | 2001-01-01 |
Customer Event
| CUST_NUM | EVENT | EVENT_DATE |
|---|---|---|
| 123 | 1 | today - 10 days |
| 456 | 10 | today - 10 days |
| 789 | 100 | today - 200 days |
Customer Call Transcript
| CUST_NUM | Transcript | EVENT_DATE |
|---|---|---|
| 789 | Hello World | today - 10 days |
In this example, for back testing for data "10 days ago", we want to filter by our whole customer base (i.e. use the feature view "customer demographics"), but when we get the features out based on my proposed sample data, each of customer: 123, 456, 789 should appear in the dataset despite not being updated in the main view.
After thinking out a loud maybe this is a "too hard" won't do. Or have an entirely different solution which is to generate a dataset with CUST_NUM, SNAPSHOT_DATE as an entity_df instead
Though store.get_latest_features() maybe the more appropriate start to this challenge.
Thanks for this @charliec443
dataset despite not being updated in the main view
What is the "main view" here?
Sorry about that, I wasn't being clear here was I.
I'll try to re-frame this problem with the lens of what I've observed, and it might just come down to "this is a robotics processing automation issue, not a feast issue" + "data scientists need to write custom code" or "this is some kind of on-line transformation feature that would come in the future"...
Problem Statement: for our marketing model, we:
- filter by customers who have had an "interaction" with us in the last 10 days
- perform model scoring for back test
The challenge here is an "interaction" is based on data on two tables. So perhaps a more appropriate "Feast" solution is to create a new feature table(?) that has the combined interaction information to filter before grabbing data from the respective "event" and "call transcript" tables.
Using only Event data
get_latest("customer_event", start_date="today - 10", end_date="today")
May inadvertently filter out customer 789
Using only call transcript
get_latest("call_transcript", start_date="today - 10", end_date="today")
Would only keep customer 789
Possible issues with "custom transformation"
If you had a custom transformation for the purposes of filtering then this could be really messy in production (as always...) as the tables you would generate would be specific to this pipeline, then having 100 models would lead to 100 such tables. Perhaps this is a necessary evil to simplify a feature store API
It would then be:
Customer demographics
| CUST_NUM | GENDER | START_DATE |
|---|---|---|
| 123 | F | 2001-01-01 |
| 456 | M | 2001-01-01 |
| 789 | NA | 2001-01-01 |
Customer Event
| CUST_NUM | EVENT | EVENT_DATE |
|---|---|---|
| 123 | 1 | today - 10 days |
| 456 | 10 | today - 10 days |
| 789 | 100 | today - 200 days |
Customer Call Transcript
| CUST_NUM | Transcript | EVENT_DATE |
|---|---|---|
| 789 | Hello World | today - 10 days |
My custom transformation to derive how a training dataset gets automatically filters (Customer_last_interaction):
| CUST_NUM | INTERACTION_DATE | INTERACTION_TYPE |
|---|---|---|
| 123 | today - 10 days | EVENT |
| 456 | today - 10 days | EVENT |
| 789 | today - 10 days | EVENT |
| 789 | today - 200 days | CALL |
Then we would create the training set via:
get_latest("customer_last_interaction", start_date="today-10", end_date="today")
Other Solutions
Perhaps the most obvious one is to support list of entities which are "magically" concatenated by entity id + event timestamp only. This just creates a mess if people combine List(string) and List(dataframes), especially if the views/entity_df have different columns
This might just be a topic to be discussed later...it certainly doesn't need to be "solved" before having a solution which tackles majority of usecases
Matt's comment here: https://github.com/feast-dev/feast/issues/1611#issuecomment-880872664 touches on this in a way.
In this setting, if we infer based on the features its constructed through assuming all entity keys are used, and we first create an entity X event_timestamp dataframe which is used as the basis for the get_historical_features method.
This approach allows mixing of entity "views", though this may be counterintuitive (can be fixed with documentation!).
Trying to explain this in words is proving to be overly complicated in my head though (apologies if it doesn't make total sense)...
Basically it boils down to this:
- if I, as a data scientist uses:
get_historical_features(entity_df=<a dataframe>)it keeps all the columns in the dataframe - On the other hand in Matt's API proposal (which solves what I was trying to get at by building an intermediary
entity X timestampdataframe), theget_historical_features(entity_df=<a entity view>)would instead not keep any of the columns unless explicitly listed infeature_refs
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I still believe that this is an important feature for batch prediction pipelines. In that case you need the latest values from the offline store.
You also need to keep this idea of an "entity_df" that we don't have with the pull_latest_from_table_or_query() method
@vas28r13 note: this is probably the better approach and mirrors what we discussed.
I'm new to Feast codebase and wanted to contribute to the project. if no one has any objection then probably I would like to start analyzing this task and implement it if it is a good one for a newbie like me.