feast icon indicating copy to clipboard operation
feast copied to clipboard

[spark] if has big offline data(on hdfs), how can I prepare train data use feast?

Open zwqjoy opened this issue 1 year ago • 4 comments

if has big offline data(on hdfs), how can I prepare train data use feast?

Can write pyspark file, and submit spark task like below ? spark-submit
--master yarn
--queue product
--deploy-mode cluster
make_train_data_with_feast.py

zwqjoy avatar Jul 12 '23 09:07 zwqjoy

what is "make_train_data_with_feast.py"? To my knowledge, Feast does not store data itself. It uses third-party storage services as offline and online storage. For your files on hdfs, maybe you can start from here: https://docs.feast.dev/reference/offline-stores/spark

shuchu avatar Jul 15 '23 01:07 shuchu

@shuchu If I have many features saved in HDFS, If some one want to merge these features(like 2,3 path feature) to prepare train data. These features very large.

  1. Before Need write the pyspark code, to read and merge
  2. Now whether I can use feast with pyspark, to read the large features into local tmp feast , and use the get_historical_features to prepare train data

zwqjoy avatar Jul 18 '23 09:07 zwqjoy

get_historical_features does not respect the Hive partitioned data and do a full table scan. I saw the query is using "<" operator instead of between. So for a table that has many partitions this could be a bottleneck.

Have you checked it? @zwqjoy

satriawadhipurusa avatar Oct 05 '23 14:10 satriawadhipurusa

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Mar 17 '24 11:03 stale[bot]