feast
feast copied to clipboard
[spark] if has big offline data(on hdfs), how can I prepare train data use feast?
if has big offline data(on hdfs), how can I prepare train data use feast?
Can write pyspark file, and submit spark task like below ?
spark-submit
--master yarn
--queue product
--deploy-mode cluster
make_train_data_with_feast.py
what is "make_train_data_with_feast.py"? To my knowledge, Feast does not store data itself. It uses third-party storage services as offline and online storage. For your files on hdfs, maybe you can start from here: https://docs.feast.dev/reference/offline-stores/spark
@shuchu If I have many features saved in HDFS, If some one want to merge these features(like 2,3 path feature) to prepare train data. These features very large.
- Before Need write the pyspark code, to read and merge
- Now whether I can use feast with pyspark, to read the large features into local tmp feast , and use the get_historical_features to prepare train data
get_historical_features
does not respect the Hive partitioned data and do a full table scan. I saw the query is using "<" operator instead of between. So for a table that has many partitions this could be a bottleneck.
Have you checked it? @zwqjoy
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.