hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-3287] Remove hudi-spark dependencies from hudi-kafka-connect-bundle

Open codope opened this issue 1 year ago • 5 comments

What is the purpose of the pull request

hudi-spark-* are not needed in hudi-kafka-connect-bundle. This PR removes those dependencies. NOTE: hudi-aws is still needed because of dependency of CloudWatchReporter in MetricsReporterFactory.

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • [ ] Has a corresponding JIRA in PR title & commit

  • [ ] Commit message is descriptive of the change

  • [ ] CI is green

  • [ ] Necessary doc changes done or have another open PR

  • [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

codope avatar Jul 11 '22 06:07 codope

@rmahindra123 Can you please review this?

codope avatar Jul 11 '22 07:07 codope

I tested this patch. But, it gives the below error in step 6 of the setup. It happens even with latest master cc @yihua @rmahindra123

ERROR WorkerSinkTask{id=hudi-sink-3} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:184)
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
	at org.apache.hudi.connect.HoodieSinkTask.start(HoodieSinkTask.java:80)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.initializeAndStart(WorkerSinkTask.java:308)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:196)
	at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:182)
	at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:231)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at org.apache.kafka.connect.runtime.isolation.PluginClassLoader.loadClass(PluginClassLoader.java:104)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	... 10 more

codope avatar Jul 18 '22 07:07 codope

I hit this issue while testing this change with Kafka Connect sink connector for Hudi:

[2022-09-16 23:44:48,156] ERROR [hudi-sink|task-3] WorkerSinkTask{id=hudi-sink-3} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:193)
java.lang.NoClassDefFoundError: org/apache/hudi/keygen/CustomKeyGenerator
	at org.apache.hudi.connect.utils.KafkaConnectUtils.getPartitionColumns(KafkaConnectUtils.java:189)
	at org.apache.hudi.connect.writers.KafkaConnectTransactionServices.<init>(KafkaConnectTransactionServices.java:90)
	at org.apache.hudi.connect.transaction.ConnectTransactionCoordinator.<init>(ConnectTransactionCoordinator.java:88)
	at org.apache.hudi.connect.HoodieSinkTask.bootstrap(HoodieSinkTask.java:191)
	at org.apache.hudi.connect.HoodieSinkTask.open(HoodieSinkTask.java:151)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.openPartitions(WorkerSinkTask.java:635)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.access$1000(WorkerSinkTask.java:71)
	at org.apache.kafka.connect.runtime.WorkerSinkTask$HandleRebalance.onPartitionsAssigned(WorkerSinkTask.java:700)
	at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.invokePartitionsAssigned(ConsumerCoordinator.java:293)
	at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:430)
	at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:450)
	at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:366)
	at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:508)
	at org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1262)
	at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1231)
	at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1211)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.pollConsumer(WorkerSinkTask.java:452)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:324)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:232)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:201)
	at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:186)
	at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:241)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

KafkaConnectUtils in hudi-kafka-connect uses CustomKeyGenerator from hudi-spark-client module. @rmahindra123 looks like we cannot get rid of hudi-spark dependencies here?

yihua avatar Sep 17 '22 06:09 yihua

I hit this issue while testing this change with Kafka Connect sink connector for Hudi:

[2022-09-16 23:44:48,156] ERROR [hudi-sink|task-3] WorkerSinkTask{id=hudi-sink-3} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:193)
java.lang.NoClassDefFoundError: org/apache/hudi/keygen/CustomKeyGenerator
	at org.apache.hudi.connect.utils.KafkaConnectUtils.getPartitionColumns(KafkaConnectUtils.java:189)
	at org.apache.hudi.connect.writers.KafkaConnectTransactionServices.<init>(KafkaConnectTransactionServices.java:90)
	at org.apache.hudi.connect.transaction.ConnectTransactionCoordinator.<init>(ConnectTransactionCoordinator.java:88)
	at org.apache.hudi.connect.HoodieSinkTask.bootstrap(HoodieSinkTask.java:191)
	at org.apache.hudi.connect.HoodieSinkTask.open(HoodieSinkTask.java:151)

KafkaConnectUtils in hudi-kafka-connect uses CustomKeyGenerator from hudi-spark-client module. @rmahindra123 looks like we cannot get rid of hudi-spark dependencies here?

@yihua @rmahindra123 I think we should remove the usage of CustomKeyGenerator. Anyway, it's marked deprecated. I have made changes to that effect but have not tested yet. Let me know what you think.

codope avatar Sep 21 '22 14:09 codope

CI report:

  • b4db097e1831a79bd0ee2097b98ef3ac855ee31a Azure: SUCCESS
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar Sep 21 '22 16:09 hudi-bot

verified locally with the bundle jar and it's working ok

xushiyan avatar Oct 28 '22 10:10 xushiyan