feast icon indicating copy to clipboard operation
feast copied to clipboard

Spark Offline Store-- 如何配置spark source?

Open zwqjoy opened this issue 1 year ago • 5 comments

我理解的配置如下后: project: feast_spark_project registry: data/registry.db provider: local offline_store: type: spark spark_conf: spark.master: yarn spark.ui.enabled: "true" spark.eventLog.enabled: "true" spark.sql.catalogImplementation: "hive" spark.sql.parser.quotedRegexColumnNames: "true" spark.sql.session.timeZone: "UTC"

然后使用如下代码定义sparksource from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import ( SparkSource,)

driver_hourly_stats= SparkSource( name="driver_hourly_stats", query="SELECT event_timestamp as ts, created_timestamp as created, conv_rate,conv_rate,conv_rate FROM emr_feature_store.driver_hourly_stats", event_timestamp_column="ts", created_timestamp_column="created" )

再定义FeatureView

training_df = store.get_historical_features( entity_df=entity_df, features=[ "driver_new_stats:new_conv_rate", "driver_new2_stats:new_conv_rate", ], full_feature_names=True, ).to_df()

我想问的是: 运行get_historical_features,是如何连接到spark 集群中,来提交任务的?

现在我提交spark任务都是通过spark-submit来提交任务,所以不太明白配置Spark Offline Store,是怎么提交任务的?

zwqjoy avatar Jul 18 '23 14:07 zwqjoy

From this document: https://docs.feast.dev/reference/offline-stores/spark

you can configure the "spark.master" to your spark cluster's IP & Port address.

project: my_project
registry: data/registry.db
provider: local
offline_store:
    type: spark
    spark_conf:
        spark.master: "local[*]"
        spark.ui.enabled: "false"
        spark.eventLog.enabled: "false"
        spark.sql.catalogImplementation: "hive"
        spark.sql.parser.quotedRegexColumnNames: "true"
        spark.sql.session.timeZone: "UTC"
online_store:
    path: data/online_store.db

shuchu avatar Jul 18 '23 15:07 shuchu

You can read the code: https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/offline_stores/contrib/spark_offline_store/spark_source.py

to understand how Feast talks with Spark.

shuchu avatar Jul 18 '23 15:07 shuchu

@shuchu

From this document: https://docs.feast.dev/reference/offline-stores/spark

you can configure the "spark.master" to your spark cluster's IP & Port address.

project: my_project
registry: data/registry.db
provider: local
offline_store:
    type: spark
    spark_conf:
        spark.master: "local[*]"
        spark.ui.enabled: "false"
        spark.eventLog.enabled: "false"
        spark.sql.catalogImplementation: "hive"
        spark.sql.parser.quotedRegexColumnNames: "true"
        spark.sql.session.timeZone: "UTC"
online_store:
    path: data/online_store.db

如果现有的spark集群上只支持使用spark-submit方式来提交任务, 那如何使用feast的功能呢?

zwqjoy avatar Jul 20 '23 06:07 zwqjoy

I would say "spark-submit" will not work with the current implementation.

In Chinese: feast现在不支持 spark-submit. 除非有人愿意写一个 新的 spark offline storage based on spark-submit.

shuchu avatar Jul 29 '23 22:07 shuchu

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Mar 17 '24 11:03 stale[bot]