hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT] --hoodie-conf not overriding value in --props file - deployment with kubernetes operator - org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer

Open mattssll opened this issue 10 months ago • 0 comments

To Reproduce

Steps to reproduce the behavior:

  1. Launch Hudi Multi Table Streamer using Spark-Operator
  2. Use hoodie-conf to override one property
  3. Pass --props with direction to props.properties file

Expected behavior We're having difficulties not to have a property with hardcoded secrets within the props.properties file in kubernetes - this file comes from a config map, and an environment variable is not accepted there - even tho we can use ENV VARS from secrets in "arguments" in the SparkApplication that is deployed through Spark Operator - it seems that the hoodie-conf parameter is not working so in the end the issue persists.

According to code and docs "hoodie-conf" is supposed to override configurations that are within properties file that is passed in "--props" argument Environment Description

  • Hudi version : 0.13.1

  • Spark version : 2.1.3

  • Running on Docker? (yes/no) : Yes, deployed in Kubernetes

Additional context

In spark operator this is the part where arguments are passed for the spark-submit job

  arguments:
      - "--props"
      - "file:///table_configs/props.properties"
      - --hoodie-conf
      - "sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username=\"myuser\" password=\"mypass\";"
      - "--schemaprovider-class"
      - "org.apache.hudi.utilities.schema.SchemaRegistryProvider"
      - "--op"
      - "UPSERT"
      - "--table-type"
      - COPY_ON_WRITE
      - "--base-path-prefix"
      - "$(ENV1)"
      - "--source-class"
      - org.apache.hudi.utilities.sources.AvroKafkaSource
      - --enable-sync
      - "--sync-tool-classes"
      - org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
      - "--source-ordering-field"
      - __kafka_ingestion_ts_ms
      - --config-folder
      - "file:///table_configs"
      - --source-limit
      - "400000"

As you can see the idea is substituting the kafka user and password with --hoodie-conf Stacktrace Issue is that this is not being substituted, I tried in both ways, having the property with a dummy value in props.properties, and not having it at all, it doesn't work in any of the two ways

Here is the spark-submit configuration:

/opt/spark/bin/spark-submit --conf spark.driver.bindAddress= --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer local:///app/hudi-utilities-bundle_2.12-0.13.1.jar --hoodie-conf 'sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="myuser" password="mypass";' --props file:///table_configs/props.properties --schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider --op UPSERT --table-type COPY_ON_WRITE --base-path-prefix s3a://xxxxxt/hudi_ingestion_data/hudi/data/ --source-class org.apache.hudi.utilities.sources.AvroKafkaSource --enable-sync --sync-tool-classes org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool --source-ordering-field __kafka_ingestion_ts_ms --config-folder file:///table_configs --source-limit 400000

If you see above the info is correct on spark-submit, yet the property line being passed with --hoodie-conf is not taking effect.

The props.properties in file:///table_configs/props.properties is being mounted from a config map, like this - in the driver and executor of spark

      configMaps:
        - name: airflow-metastore-config
          path: /table_configs

The config map contains:

apiVersion: v1
kind: ConfigMap
metadata:
  name: airflow-metastore-config
  namespace: spark
data:
  props.properties: |-
    hoodie.deltastreamer.ingestion.tablesToBeIngested=abc.celery_taskmeta,abc.dag,abc.dag_run,abc.job,abc.log,abc.sla_miss,abc.slot_pool,abc.task_fail,abc.task_instance

    hoodie.deltastreamer.ingestion.abc.celery_taskmeta.configFile=file:///table_configs/celery_taskmeta.properties
    hoodie.deltastreamer.ingestion.abc.dag.configFile=file:///table_configs/dag.properties
    hoodie.deltastreamer.ingestion.abc.dag_run.configFile=file:///table_configs/dag_run.properties
    hoodie.deltastreamer.ingestion.abc.job.configFile=file:///table_configs/job.properties
    hoodie.deltastreamer.ingestion.abc.log.configFile=file:///table_configs/log.properties
    hoodie.deltastreamer.ingestion.abc.sla_miss.configFile=file:///table_configs/sla_miss.properties
    hoodie.deltastreamer.ingestion.abc.slot_pool.configFile=file:///table_configs/slot_pool.properties
    hoodie.deltastreamer.ingestion.abc.task_fail.configFile=file:///table_configs/task_fail.properties
    hoodie.deltastreamer.ingestion.abc.task_instance.configFile=file:///table_configs/task_instance.properties
    bootstrap.servers=leleelalalal:9096
    auto.offset.reset=earliest
    security.protocol=SASL_SSL
    sasl.mechanism=SCRAM-SHA-512
    sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="u" password="p";
    schema.registry.url=http://schema-registry-confluent.kafka.svc.cluster.local:8081

    hoodie.datasource.write.insert.drop.duplicates=true

    group.id=hudigroupid

    hoodie.deltastreamer.schemaprovider.registry.baseUrl=http://schema-registry-confluent.kafka.svc.cluster.local:8081/subjects/
    hoodie.deltastreamer.schemaprovider.registry.urlSuffix=-value/versions/latest

mattssll avatar Apr 24 '24 13:04 mattssll