hudi
hudi copied to clipboard
[HUDI-3287] Remove hudi-spark dependencies from hudi-kafka-connect-bundle
What is the purpose of the pull request
hudi-spark-* are not needed in hudi-kafka-connect-bundle. This PR removes those dependencies.
NOTE: hudi-aws is still needed because of dependency of CloudWatchReporter
in MetricsReporterFactory
.
Brief change log
(for example:)
- Modify AnnotationLocation checkstyle rule in checkstyle.xml
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
- Added integration tests for end-to-end.
- Added HoodieClientWriteTest to verify the change.
- Manually verified the change by running a job locally.
Committer checklist
-
[ ] Has a corresponding JIRA in PR title & commit
-
[ ] Commit message is descriptive of the change
-
[ ] CI is green
-
[ ] Necessary doc changes done or have another open PR
-
[ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
@rmahindra123 Can you please review this?
I tested this patch. But, it gives the below error in step 6 of the setup. It happens even with latest master cc @yihua @rmahindra123
ERROR WorkerSinkTask{id=hudi-sink-3} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:184)
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
at org.apache.hudi.connect.HoodieSinkTask.start(HoodieSinkTask.java:80)
at org.apache.kafka.connect.runtime.WorkerSinkTask.initializeAndStart(WorkerSinkTask.java:308)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:196)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:182)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:231)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at org.apache.kafka.connect.runtime.isolation.PluginClassLoader.loadClass(PluginClassLoader.java:104)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 10 more
I hit this issue while testing this change with Kafka Connect sink connector for Hudi:
[2022-09-16 23:44:48,156] ERROR [hudi-sink|task-3] WorkerSinkTask{id=hudi-sink-3} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:193)
java.lang.NoClassDefFoundError: org/apache/hudi/keygen/CustomKeyGenerator
at org.apache.hudi.connect.utils.KafkaConnectUtils.getPartitionColumns(KafkaConnectUtils.java:189)
at org.apache.hudi.connect.writers.KafkaConnectTransactionServices.<init>(KafkaConnectTransactionServices.java:90)
at org.apache.hudi.connect.transaction.ConnectTransactionCoordinator.<init>(ConnectTransactionCoordinator.java:88)
at org.apache.hudi.connect.HoodieSinkTask.bootstrap(HoodieSinkTask.java:191)
at org.apache.hudi.connect.HoodieSinkTask.open(HoodieSinkTask.java:151)
at org.apache.kafka.connect.runtime.WorkerSinkTask.openPartitions(WorkerSinkTask.java:635)
at org.apache.kafka.connect.runtime.WorkerSinkTask.access$1000(WorkerSinkTask.java:71)
at org.apache.kafka.connect.runtime.WorkerSinkTask$HandleRebalance.onPartitionsAssigned(WorkerSinkTask.java:700)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.invokePartitionsAssigned(ConsumerCoordinator.java:293)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:430)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:450)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:366)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:508)
at org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1262)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1231)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1211)
at org.apache.kafka.connect.runtime.WorkerSinkTask.pollConsumer(WorkerSinkTask.java:452)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:324)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:232)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:201)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:186)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:241)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
KafkaConnectUtils
in hudi-kafka-connect
uses CustomKeyGenerator
from hudi-spark-client module
. @rmahindra123 looks like we cannot get rid of hudi-spark dependencies here?
I hit this issue while testing this change with Kafka Connect sink connector for Hudi:
[2022-09-16 23:44:48,156] ERROR [hudi-sink|task-3] WorkerSinkTask{id=hudi-sink-3} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:193) java.lang.NoClassDefFoundError: org/apache/hudi/keygen/CustomKeyGenerator at org.apache.hudi.connect.utils.KafkaConnectUtils.getPartitionColumns(KafkaConnectUtils.java:189) at org.apache.hudi.connect.writers.KafkaConnectTransactionServices.<init>(KafkaConnectTransactionServices.java:90) at org.apache.hudi.connect.transaction.ConnectTransactionCoordinator.<init>(ConnectTransactionCoordinator.java:88) at org.apache.hudi.connect.HoodieSinkTask.bootstrap(HoodieSinkTask.java:191) at org.apache.hudi.connect.HoodieSinkTask.open(HoodieSinkTask.java:151)
KafkaConnectUtils
inhudi-kafka-connect
usesCustomKeyGenerator
fromhudi-spark-client module
. @rmahindra123 looks like we cannot get rid of hudi-spark dependencies here?
@yihua @rmahindra123 I think we should remove the usage of CustomKeyGenerator
. Anyway, it's marked deprecated. I have made changes to that effect but have not tested yet. Let me know what you think.
CI report:
- b4db097e1831a79bd0ee2097b98ef3ac855ee31a Azure: SUCCESS
Bot commands
@hudi-bot supports the following commands:-
@hudi-bot run azure
re-run the last Azure build
verified locally with the bundle jar and it's working ok