hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[FeatureRequest] Inquiry Regarding Hudi Exporter with SQL Transformer for Data Filtering

Open soumilshah1995 opened this issue 1 year ago • 2 comments

Hello,

I'm reaching out to inquire about the Hudi exporter service. I've had some experience working with it, but I'm particularly interested in whether we support the integration of SQL transformer with it. The concept is to utilize the Hudi export utility for exporting Hudi data. However, there could be instances where customers require exporting filtered data. For instance, they might need all data related to a specific stock like AAPL. Do we have plans to incorporate a filtering mechanism into the Hudi exporter? Here's an example of the Spark-submit command:


park-submit \
    --class org.apache.hudi.utilities.HoodieSnapshotExporter \
    --packages 'org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.0' \
    --master 'local[*]' \
    --executor-memory 1g \
    /Users/soumilshah/IdeaProjects/SparkProject/DeltaStreamer/jar/hudi-utilities-slim-bundle_2.12-0.14.0.jar \
    --source-base-path 'file:///Users/soumilshah/IdeaProjects/SparkProject/DeltaStreamer/hudi/bronze_orders' \
    --target-output-path 'file:///Users/soumilshah/IdeaProjects/SparkProject/DeltaStreamer/hudi/json/' \
    --output-format 'json'

i tried this flags


--transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
--hoodie-conf hoodie.deltastreamer.transformer.sql='SELECT *, extract(year from order_date) as year, extract(month from order_date) as month  FROM <SRC> a' \

Looks like its not supported with HoodieSnapshotExporter REF https://hudi.apache.org/docs/snapshot_exporter (edited)

soumilshah1995 avatar Feb 09 '24 13:02 soumilshah1995

Hey Soumil,

Thanks for the suggestion. Can be a good feature add to hudi exporter utility.

Created tracking JIRA for the same - https://issues.apache.org/jira/browse/HUDI-7403

Thanks, Aditya

ad1happy2go avatar Feb 12 '24 14:02 ad1happy2go

Thank you sir

soumilshah1995 avatar Feb 12 '24 23:02 soumilshah1995

Hi there I saw that this ticket was completed and I was trying out this functionality

Docs


xport to json or parquet dataset with transformation/filtering

The Exporter supports custom transformation/filtering on records before writing to json or parquet dataset. This is done by supplying implementation of org.apache.hudi.utilities.transform.Transformer via --transformer-class option.

spark-submit \
  --jars "packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.15.0.jar" \
  --deploy-mode "client" \
  --class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
      packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.15.0.jar \
  --source-base-path "/tmp/" \
  --target-output-path "/tmp/exported/json/" \
  --transformer-class "org.apache.hudi.utilities.transform.SqlQueryBasedTransformer" \
  --transformer-sql "SELECT substr(rider,1,10) as rider, trip_type as tripType FROM <SRC> WHERE trip_type = 'BLACK' LIMIT 10" \
  --output-format "json"  # or "parquet"

https://hudi.apache.org/docs/next/snapshot_exporter/

Following is failing

TEST 1: NO tranformer PASS

spark-submit \
    --class org.apache.hudi.utilities.HoodieSnapshotExporter \
    --packages org.apache.hudi:hudi-spark3.4-bundle_2.12:0.15.0 \
    --master 'local[*]' \
    --executor-memory 1g \
    /Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E1/jar/hudi-utilities-slim-bundle_2.12-0.15.0.jar \
    --source-base-path '/Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E1/silver/' \
    --target-output-path '/Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E1/dump/json/' \
    --output-format 'parquet'

Test 2 : With transformer

spark-submit \
    --class org.apache.hudi.utilities.HoodieSnapshotExporter \
    --packages org.apache.hudi:hudi-spark3.4-bundle_2.12:0.15.0 \
    --master 'local[*]' \
    --executor-memory 1g \
    /Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E1/jar/hudi-utilities-slim-bundle_2.12-0.15.0.jar \
    --source-base-path '/Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E1/silver/' \
    --target-output-path '/Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E1/dump/json/' \
    --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
    --transformer-sql "SELECT * FROM <SRC> WHERE destinationstate='NY'" \
    --output-format 'parquet'

logs


vy Default Cache set to: /Users/soumilshah/.ivy2/cache
The jars for the packages stored in: /Users/soumilshah/.ivy2/jars
org.apache.hudi#hudi-spark3.4-bundle_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-f0fce1d3-e446-495c-a37a-e2dd7e335611;1.0
	confs: [default]
	found org.apache.hudi#hudi-spark3.4-bundle_2.12;0.15.0 in central
:: resolution report :: resolve 56ms :: artifacts dl 1ms
	:: modules in use:
	org.apache.hudi#hudi-spark3.4-bundle_2.12;0.15.0 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   1   |   0   |   0   |   0   ||   1   |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-f0fce1d3-e446-495c-a37a-e2dd7e335611
	confs: [default]
	0 artifacts copied, 1 already retrieved (0kB/2ms)
24/07/25 13:43:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" org.apache.hudi.com.beust.jcommander.ParameterException: Was passed main parameter '--transformer-class' but no main parameter was defined in your arg class
	at org.apache.hudi.com.beust.jcommander.JCommander.initMainParameterValue(JCommander.java:954)
	at org.apache.hudi.com.beust.jcommander.JCommander.parseValues(JCommander.java:755)
	at org.apache.hudi.com.beust.jcommander.JCommander.parse(JCommander.java:356)
	at org.apache.hudi.com.beust.jcommander.JCommander.parse(JCommander.java:335)
	at org.apache.hudi.com.beust.jcommander.JCommander.<init>(JCommander.java:251)
	at org.apache.hudi.utilities.HoodieSnapshotExporter.main(HoodieSnapshotExporter.java:292)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1020)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1111)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1120)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
24/07/25 13:43:59 INFO ShutdownHookManager: Shutdown hook called
24/07/25 13:43:59 INFO ShutdownHookManager: Deleting directory /private/var/folders/qq/s_1bjv516pn_mck29cwdwxnm0000gp/T/spark-4431a82f-15b8-4ac6-948a-db853cbf9fe3
(base) soumilshah@ip-192-168-1-31 E1 % 

soumilshah1995 avatar Jul 25 '24 17:07 soumilshah1995

@wombatu-kun can you take care of this?

danny0405 avatar Jul 26 '24 01:07 danny0405

@wombatu-kun can you take care of this?

Yes of course, i'm already trying to figure out

wombatu-kun avatar Jul 26 '24 01:07 wombatu-kun

@soumilshah1995 hi! This feature exists only in master branch, but it looks like you are trying to test it on release 0.15 (hudi-utilities-slim-bundle_2.12-0.15.0.jar), which does not contain this commit. Am i right?

wombatu-kun avatar Jul 26 '24 03:07 wombatu-kun

Yes you are right I will compile master branch and keep you posted

Thanking You, Soumil Nitin Shah

On Thu, Jul 25, 2024 at 11:29 PM Vova Kolmakov @.***> wrote:

@soumilshah1995 https://github.com/soumilshah1995 hi! This feature exists only in master branch, but it looks like you are trying to test it on release 0.15, which does not contain this commit. Am i right?

— Reply to this email directly, view it on GitHub https://github.com/apache/hudi/issues/10644#issuecomment-2251893241, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMF5P6H2EX6Q5REORLHCQ3ZOG7BPAVCNFSM6AAAAABDBPL24WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJRHA4TGMRUGE . You are receiving this because you were mentioned.Message ID: @.***>

soumilshah1995 avatar Jul 26 '24 03:07 soumilshah1995

tested with Master Branch Test passed


spark-submit \
    --class org.apache.hudi.utilities.HoodieSnapshotExporter \
    --packages org.apache.hudi:hudi-spark3.4-bundle_2.12:1.0.0-beta2 \
    --master 'local[*]' \
    --executor-memory 1g \
    /Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E1/jar/hudi-utilities-slim-bundle_2.12-1.0.0-beta2.jar \
    --source-base-path '/Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E1/silver/' \
    --target-output-path '/Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E1/dump/parquet/' \
    --transformer-class "org.apache.hudi.utilities.transform.SqlQueryBasedTransformer" \
    --transformer-sql "SELECT * FROM <SRC> WHERE destinationstate='NY'" \
    --output-format 'parquet'

image

soumilshah1995 avatar Jul 26 '24 15:07 soumilshah1995