hudi
hudi copied to clipboard
[SUPPORT] --hoodie-conf not overriding value in --props file - deployment with kubernetes operator - org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer
To Reproduce
Steps to reproduce the behavior:
- Launch Hudi Multi Table Streamer using Spark-Operator
- Use hoodie-conf to override one property
- Pass --props with direction to props.properties file
Expected behavior We're having difficulties not to have a property with hardcoded secrets within the props.properties file in kubernetes - this file comes from a config map, and an environment variable is not accepted there - even tho we can use ENV VARS from secrets in "arguments" in the SparkApplication that is deployed through Spark Operator - it seems that the hoodie-conf parameter is not working so in the end the issue persists.
According to code and docs "hoodie-conf" is supposed to override configurations that are within properties file that is passed in "--props" argument Environment Description
-
Hudi version : 0.13.1
-
Spark version : 2.1.3
-
Running on Docker? (yes/no) : Yes, deployed in Kubernetes
Additional context
In spark operator this is the part where arguments are passed for the spark-submit job
arguments:
- "--props"
- "file:///table_configs/props.properties"
- --hoodie-conf
- "sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username=\"myuser\" password=\"mypass\";"
- "--schemaprovider-class"
- "org.apache.hudi.utilities.schema.SchemaRegistryProvider"
- "--op"
- "UPSERT"
- "--table-type"
- COPY_ON_WRITE
- "--base-path-prefix"
- "$(ENV1)"
- "--source-class"
- org.apache.hudi.utilities.sources.AvroKafkaSource
- --enable-sync
- "--sync-tool-classes"
- org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
- "--source-ordering-field"
- __kafka_ingestion_ts_ms
- --config-folder
- "file:///table_configs"
- --source-limit
- "400000"
As you can see the idea is substituting the kafka user and password with --hoodie-conf Stacktrace Issue is that this is not being substituted, I tried in both ways, having the property with a dummy value in props.properties, and not having it at all, it doesn't work in any of the two ways
Here is the spark-submit configuration:
/opt/spark/bin/spark-submit --conf spark.driver.bindAddress= --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer local:///app/hudi-utilities-bundle_2.12-0.13.1.jar --hoodie-conf 'sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="myuser" password="mypass";' --props file:///table_configs/props.properties --schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider --op UPSERT --table-type COPY_ON_WRITE --base-path-prefix s3a://xxxxxt/hudi_ingestion_data/hudi/data/ --source-class org.apache.hudi.utilities.sources.AvroKafkaSource --enable-sync --sync-tool-classes org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool --source-ordering-field __kafka_ingestion_ts_ms --config-folder file:///table_configs --source-limit 400000
If you see above the info is correct on spark-submit, yet the property line being passed with --hoodie-conf is not taking effect.
The props.properties in file:///table_configs/props.properties is being mounted from a config map, like this - in the driver and executor of spark
configMaps:
- name: airflow-metastore-config
path: /table_configs
The config map contains:
apiVersion: v1
kind: ConfigMap
metadata:
name: airflow-metastore-config
namespace: spark
data:
props.properties: |-
hoodie.deltastreamer.ingestion.tablesToBeIngested=abc.celery_taskmeta,abc.dag,abc.dag_run,abc.job,abc.log,abc.sla_miss,abc.slot_pool,abc.task_fail,abc.task_instance
hoodie.deltastreamer.ingestion.abc.celery_taskmeta.configFile=file:///table_configs/celery_taskmeta.properties
hoodie.deltastreamer.ingestion.abc.dag.configFile=file:///table_configs/dag.properties
hoodie.deltastreamer.ingestion.abc.dag_run.configFile=file:///table_configs/dag_run.properties
hoodie.deltastreamer.ingestion.abc.job.configFile=file:///table_configs/job.properties
hoodie.deltastreamer.ingestion.abc.log.configFile=file:///table_configs/log.properties
hoodie.deltastreamer.ingestion.abc.sla_miss.configFile=file:///table_configs/sla_miss.properties
hoodie.deltastreamer.ingestion.abc.slot_pool.configFile=file:///table_configs/slot_pool.properties
hoodie.deltastreamer.ingestion.abc.task_fail.configFile=file:///table_configs/task_fail.properties
hoodie.deltastreamer.ingestion.abc.task_instance.configFile=file:///table_configs/task_instance.properties
bootstrap.servers=leleelalalal:9096
auto.offset.reset=earliest
security.protocol=SASL_SSL
sasl.mechanism=SCRAM-SHA-512
sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="u" password="p";
schema.registry.url=http://schema-registry-confluent.kafka.svc.cluster.local:8081
hoodie.datasource.write.insert.drop.duplicates=true
group.id=hudigroupid
hoodie.deltastreamer.schemaprovider.registry.baseUrl=http://schema-registry-confluent.kafka.svc.cluster.local:8081/subjects/
hoodie.deltastreamer.schemaprovider.registry.urlSuffix=-value/versions/latest