hudi
hudi copied to clipboard
[FeatureRequest] Inquiry Regarding Hudi Exporter with SQL Transformer for Data Filtering
Hello,
I'm reaching out to inquire about the Hudi exporter service. I've had some experience working with it, but I'm particularly interested in whether we support the integration of SQL transformer with it. The concept is to utilize the Hudi export utility for exporting Hudi data. However, there could be instances where customers require exporting filtered data. For instance, they might need all data related to a specific stock like AAPL. Do we have plans to incorporate a filtering mechanism into the Hudi exporter? Here's an example of the Spark-submit command:
park-submit \
--class org.apache.hudi.utilities.HoodieSnapshotExporter \
--packages 'org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.0' \
--master 'local[*]' \
--executor-memory 1g \
/Users/soumilshah/IdeaProjects/SparkProject/DeltaStreamer/jar/hudi-utilities-slim-bundle_2.12-0.14.0.jar \
--source-base-path 'file:///Users/soumilshah/IdeaProjects/SparkProject/DeltaStreamer/hudi/bronze_orders' \
--target-output-path 'file:///Users/soumilshah/IdeaProjects/SparkProject/DeltaStreamer/hudi/json/' \
--output-format 'json'
i tried this flags
--transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
--hoodie-conf hoodie.deltastreamer.transformer.sql='SELECT *, extract(year from order_date) as year, extract(month from order_date) as month FROM <SRC> a' \
Looks like its not supported with HoodieSnapshotExporter REF https://hudi.apache.org/docs/snapshot_exporter (edited)
Hey Soumil,
Thanks for the suggestion. Can be a good feature add to hudi exporter utility.
Created tracking JIRA for the same - https://issues.apache.org/jira/browse/HUDI-7403
Thanks, Aditya
Thank you sir
Hi there I saw that this ticket was completed and I was trying out this functionality
Docs
xport to json or parquet dataset with transformation/filtering
The Exporter supports custom transformation/filtering on records before writing to json or parquet dataset. This is done by supplying implementation of org.apache.hudi.utilities.transform.Transformer via --transformer-class option.
spark-submit \
--jars "packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.15.0.jar" \
--deploy-mode "client" \
--class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.15.0.jar \
--source-base-path "/tmp/" \
--target-output-path "/tmp/exported/json/" \
--transformer-class "org.apache.hudi.utilities.transform.SqlQueryBasedTransformer" \
--transformer-sql "SELECT substr(rider,1,10) as rider, trip_type as tripType FROM <SRC> WHERE trip_type = 'BLACK' LIMIT 10" \
--output-format "json" # or "parquet"
https://hudi.apache.org/docs/next/snapshot_exporter/
Following is failing
TEST 1: NO tranformer PASS
spark-submit \
--class org.apache.hudi.utilities.HoodieSnapshotExporter \
--packages org.apache.hudi:hudi-spark3.4-bundle_2.12:0.15.0 \
--master 'local[*]' \
--executor-memory 1g \
/Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E1/jar/hudi-utilities-slim-bundle_2.12-0.15.0.jar \
--source-base-path '/Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E1/silver/' \
--target-output-path '/Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E1/dump/json/' \
--output-format 'parquet'
Test 2 : With transformer
spark-submit \
--class org.apache.hudi.utilities.HoodieSnapshotExporter \
--packages org.apache.hudi:hudi-spark3.4-bundle_2.12:0.15.0 \
--master 'local[*]' \
--executor-memory 1g \
/Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E1/jar/hudi-utilities-slim-bundle_2.12-0.15.0.jar \
--source-base-path '/Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E1/silver/' \
--target-output-path '/Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E1/dump/json/' \
--transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
--transformer-sql "SELECT * FROM <SRC> WHERE destinationstate='NY'" \
--output-format 'parquet'
logs
vy Default Cache set to: /Users/soumilshah/.ivy2/cache
The jars for the packages stored in: /Users/soumilshah/.ivy2/jars
org.apache.hudi#hudi-spark3.4-bundle_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-f0fce1d3-e446-495c-a37a-e2dd7e335611;1.0
confs: [default]
found org.apache.hudi#hudi-spark3.4-bundle_2.12;0.15.0 in central
:: resolution report :: resolve 56ms :: artifacts dl 1ms
:: modules in use:
org.apache.hudi#hudi-spark3.4-bundle_2.12;0.15.0 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 1 | 0 | 0 | 0 || 1 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-f0fce1d3-e446-495c-a37a-e2dd7e335611
confs: [default]
0 artifacts copied, 1 already retrieved (0kB/2ms)
24/07/25 13:43:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" org.apache.hudi.com.beust.jcommander.ParameterException: Was passed main parameter '--transformer-class' but no main parameter was defined in your arg class
at org.apache.hudi.com.beust.jcommander.JCommander.initMainParameterValue(JCommander.java:954)
at org.apache.hudi.com.beust.jcommander.JCommander.parseValues(JCommander.java:755)
at org.apache.hudi.com.beust.jcommander.JCommander.parse(JCommander.java:356)
at org.apache.hudi.com.beust.jcommander.JCommander.parse(JCommander.java:335)
at org.apache.hudi.com.beust.jcommander.JCommander.<init>(JCommander.java:251)
at org.apache.hudi.utilities.HoodieSnapshotExporter.main(HoodieSnapshotExporter.java:292)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1020)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1111)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
24/07/25 13:43:59 INFO ShutdownHookManager: Shutdown hook called
24/07/25 13:43:59 INFO ShutdownHookManager: Deleting directory /private/var/folders/qq/s_1bjv516pn_mck29cwdwxnm0000gp/T/spark-4431a82f-15b8-4ac6-948a-db853cbf9fe3
(base) soumilshah@ip-192-168-1-31 E1 %
@wombatu-kun can you take care of this?
@wombatu-kun can you take care of this?
Yes of course, i'm already trying to figure out
@soumilshah1995 hi! This feature exists only in master branch, but it looks like you are trying to test it on release 0.15 (hudi-utilities-slim-bundle_2.12-0.15.0.jar), which does not contain this commit. Am i right?
Yes you are right I will compile master branch and keep you posted
Thanking You, Soumil Nitin Shah
On Thu, Jul 25, 2024 at 11:29 PM Vova Kolmakov @.***> wrote:
@soumilshah1995 https://github.com/soumilshah1995 hi! This feature exists only in master branch, but it looks like you are trying to test it on release 0.15, which does not contain this commit. Am i right?
— Reply to this email directly, view it on GitHub https://github.com/apache/hudi/issues/10644#issuecomment-2251893241, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMF5P6H2EX6Q5REORLHCQ3ZOG7BPAVCNFSM6AAAAABDBPL24WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJRHA4TGMRUGE . You are receiving this because you were mentioned.Message ID: @.***>
tested with Master Branch Test passed
spark-submit \
--class org.apache.hudi.utilities.HoodieSnapshotExporter \
--packages org.apache.hudi:hudi-spark3.4-bundle_2.12:1.0.0-beta2 \
--master 'local[*]' \
--executor-memory 1g \
/Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E1/jar/hudi-utilities-slim-bundle_2.12-1.0.0-beta2.jar \
--source-base-path '/Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E1/silver/' \
--target-output-path '/Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E1/dump/parquet/' \
--transformer-class "org.apache.hudi.utilities.transform.SqlQueryBasedTransformer" \
--transformer-sql "SELECT * FROM <SRC> WHERE destinationstate='NY'" \
--output-format 'parquet'