feast
feast copied to clipboard
Historical retrieval without an entity dataframe
Is your feature request related to a problem? Please describe.
The current Feast get_historical_features()
method requires that users provide an entity dataframe as follows
training_df = store.get_historical_features(
entity_df=entity_df,
feature_refs = [
'drivers_activity:trips_today'
'drivers_profile:rating'
],
)
However, many users would like the feature store to provide entities to them for training, instead of having to query or provide entities as part of the entity dataframe.
Describe the solution you'd like Allow users to specify an existing feature view from which an entity dataframe will be queried.
training_df = store.get_historical_features(
entity_df="drivers_activity",
feature_refs = [
'drivers_activity:trips_today'
'drivers_profile:rating'
],
)
With the addition of time range filtering.
training_df = store.get_historical_features(
entity_df="drivers_activity",
feature_refs = [
'drivers_activity:trips_today'
'drivers_profile:rating'
],
from_date=(today - timedelta(days = 7)),
to_date=datetime.now(),
)
training_df = store.get_historical_features( left_table="drivers_activity", feature_refs = [ 'drivers_activity:trips_today' 'drivers_activity:rating' ], )
Does this mean the resulting training_df
contain every row (but only selectdriver_id, event_timestamp, trips_today, and rating
columns), from the drivers_activity
view ?
training_df = store.get_historical_features( left_table="drivers_activity", feature_refs = [ 'drivers_activity:trips_today' 'drivers_activity:rating' ], )
Does this mean the resulting
training_df
contain every row (but only selectdriver_id, event_timestamp, trips_today, and rating
columns), from thedrivers_activity
view ?
Actually my example was poor. I've modified it to show that we can query multiple feature views. Essentially how it works is that we will query the entity_df
for all entities, but it can now be an existing feature view. We would only query it for timestamps and entity columns. Features then get joined onto those rows as usual.
Should there also be an option to "keep latest" only, when used in conjunction with the time range filtering Otherwise its more than possible that the underlying entity dataframe could have duplicated entity keys.
The usecase for this in my mind is for backtesting purposes.
Should there also be an option to "keep latest" only, when used in conjunction with the time range filtering Otherwise its more than possible that the underlying entity dataframe could have duplicated entity keys.
The usecase for this in my mind is for backtesting purposes.
Do you mean entity row or entity key? https://docs.feast.dev/concepts/data-model-and-concepts#entity-row
So you would not want to return features with the same entity key over different dates?
I was thinking entity key. Only as an option - there are use cases for enabling both of them.
For example, if our machine learning deployment is a daily batch job, perhaps for back-testing we would have the get_historical_features(from_date=my_date-timedelta(days=1), to_date=my_date)
, where my_date
is the timestamp to simulate when our machine learning job "would run" on a daily basis
Though this then raises a good question on how this kind of workflow should be productionised? E.g. if I have an hourly/daily batch which goes through our whole customer base to find fraudulent customers, how should this work in feast? We wouldn't really use the online store for this, and this API could look something like:
my_daily_batch_scoring_df = store.get_historical_features(
entity_df = "my_df",
feature_refs = [...],
latest=True,
from_date=(today - timedelta(days = 1)),
to_date=datetime.now(),
)
Probably a discussion for another thread...
Can I give this a go and raise a PR for File based offlinestore only?
I'll stick the spec written(?), though I noticed elsewhere in the repo the nomenclature used was start_date
and end_date
- should we align to that rather than from_date
and to_date
https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/offline_stores/file.py#L219-L220 ?
I was thinking entity key. Only as an option - there are use cases for enabling both of them.
For example, if our machine learning deployment is a daily batch job, perhaps for back-testing we would have the
get_historical_features(from_date=my_date-timedelta(days=1), to_date=my_date)
, wheremy_date
is the timestamp to simulate when our machine learning job "would run" on a daily basisThough this then raises a good question on how this kind of workflow should be productionised? E.g. if I have an hourly/daily batch which goes through our whole customer base to find fraudulent customers, how should this work in feast? We wouldn't really use the online store for this, and this API could look something like:
my_daily_batch_scoring_df = store.get_historical_features( entity_df = "my_df", feature_refs = [...], latest=True, from_date=(today - timedelta(days = 1)), to_date=datetime.now(), )
Probably a discussion for another thread...
I can see the value in this. In fact, some other folks have also asked for it. Would you mind creating a new issue and linking back to this issue for us? I think it's worth a separate discussion. Specifically, the need for a latest only
argument in get_historical_features()
.
Can I give this a go and raise a PR for File based offlinestore only?
I'll stick the spec written(?), though I noticed elsewhere in the repo the nomenclature used was
start_date
andend_date
- should we align to that rather thanfrom_date
andto_date
https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/offline_stores/file.py#L219-L220 ?
You can give it a go, but we probably won't release it until we have support for all our main stores. Perhaps a better middle ground is to add a new method to the FeatureStore class and have it throw a NotImplemented
exception for the other stores, and specifically print warnings that this functionality is experimental and will change.
Sounds good, hopefully I'll pull something together "soon". I'll name the method something sensible as well.
Hi there 👋 ,
As I already explained to Willem, we built an higher level API on our side to make the life of our users easier
It basically does the following
def get_historical_features(
feature_refs: List[str],
threshold: Union[datetime, date] = None,
sample_size: int = 1000,
left_feature_view: Union[pd.DataFrame, str] = None,
full_feature_names: bool = False,
) -> BigQueryRetrievalJob:
# If all the features come from the same FeatureView then we infer the `left_feature_view` parameter
# We get the unique_join_keys in order to remove some duplicate data if it exists
# It's more or less the following
query = f"""
SELECT
{', '.join(unique_join_keys)},
TIMESTAMP '{str_timestamp}' AS {timestamp_column}
FROM {source_table}
{where_clause}
GROUP BY {', '.join(unique_join_keys)}
{limit_clause}
"""
# The limit_clause only exist if we want a sample of the left FeatureView
store = FeatureStore()
# We build the query for our users and pass it to Feast
return store.get_historical_features(
entity_df=sql_query,
feature_refs=features,
full_feature_names=full_feature_names,
)
Happy to have a chat about a similar API implemented in Feast
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Any workaround for the problem of getting "training datasets" (with no passing entity ids)?
(2) Any workaround for the problem of getting "training datasets" (with no passing entity ids)?
You can use pull_latest_from_table_or_query
which will do the trick. Of
course it would be nice if there is a suitable abstraction that feels the
"same" as existing APIs
On Wed, 30 Mar 2022 at 08:04, Felipe @.***> wrote:
(2) Any workaround for the problem of getting "training datasets" (with no passing entity ids)?
— Reply to this email directly, view it on GitHub https://github.com/feast-dev/feast/issues/1611#issuecomment-1082371571, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATCATVYO6JWYMIYSECBX6TVCNV4NANCNFSM455IQMGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
-- Chapman
Hi there, could you please show an example of how to use pull_latest_from_table_or_query
?
There is an example for Spark. It doesn't understand multi sources, FeatureViews...
from feast.infra.offline_stores.contrib.spark_offline_store.spark import SparkOfflineStore
fs = feast.FeatureStore(repo_path="/home/feast/feast_repo/large_foal/feature_repo")
feast_features = [
"crim",
"zn",
"indus"
]
srj_latest = SparkOfflineStore.pull_latest_from_table_or_query(
config=fs.config,
data_source=fs.get_data_source("boston_source"),
join_key_columns=["entity_id"],
feature_name_columns=feast_features,
timestamp_field="update_ts",
created_timestamp_column="create_ts",
start_date=datetime(2022, 11, 20),
end_date=datetime(2022, 11, 21)
)
srj_latest.to_spark_df().show()
Hello, is this implemented already, I use parquet file as source, and want to retrival the historical features with a time range, don't want to define a entity df with event_timestamp, is this possible, how to do it ?
Thanks