hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT]HoodieException: Commit 20220509105215 failed and rolled-back ! at org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:492)

Open nleena123 opened this issue 2 years ago • 23 comments

Hi Team, our hudi jobs are failing with below exception. Can you please help me on this.

HoodieException: Commit 20220509105215 failed and rolled-back ! at org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:492)

Environment Description

  • Hudi version :0.8.0

  • Spark version :3.2.1

  • Storage (HDFS/S3/GCS..) : AZURE blod storage

we are running the apache hudi spark in azure databricks . using below dependancies .

org.apache.spark:spark-avro_2.12:3.1.1 org.apache.hudi:hudi-utilities-bundle_2.12:0.8.0 org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0

nleena123 avatar May 09 '22 11:05 nleena123

Can you share the entire stacktrace?

pratyakshsharma avatar May 09 '22 13:05 pratyakshsharma

@pratyakshsharma HI what is stacktrace ?, i am not aware of it ,can you please help with more details.

nleena123 avatar May 10 '22 13:05 nleena123

@nleena123 I meant to say please share the detailed error log and not just the exception message. :)

pratyakshsharma avatar May 10 '22 13:05 pratyakshsharma

Hi @pratyakshsharma

Please find the attached complete error log details.

metrics_job_logs.txt

nleena123 avatar May 11 '22 05:05 nleena123

@nsivabalan Can you help us , this is happing in our production server.

nleena123 avatar May 12 '22 13:05 nleena123

can you furnish the write configs you use to write to hudi.

nsivabalan avatar May 12 '22 23:05 nsivabalan

#hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator hoodie.datasource.write.hive_styling_partioning=true hoodie.datasource.write.partitionpath.urlencode=false

#hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator hoodie.datasource.write.recordkey.field=interaction_id hoodie.datasource.write.partitionpath.field=create_datetime:TIMESTAMP hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING hoodie.deltastreamer.keygen.timebased.input.dateformat=yyyyMMdd HH:mm hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM hoodie.deltastreamer.keygen.timebased.timezone=UTC-6:00 hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator #20211119 13:39 #hoodie.index.type=GLOBAL_BLOOM #hoodie.bloom.index.update.partition.path=true

schema provider configs

#hoodie.deltastreamer.schemaprovider.registry.url=https://azure-prod-schema-registry.extscloud.com/subjects/async.messaging.interaction.metrics-value/versions/3 hoodie.deltastreamer.schemaprovider.registry.url=https://azure-prod-schema-registry.extscloud.com/subjects/async.messaging.interaction.metrics-value/versions/latest

Kafka Source

hoodie.deltastreamer.source.kafka.topic=async.messaging.interaction.metrics #Kafka props #hoodie.auto.commit=true enable.auto.commit=true #hoodie.deltastreamer.kafka.source.maxEvents=1000000 group.id=async.messaging.metrics bootstrap.servers=prod-hcc-kaas-central-13.centralus.cloudapp.azure.com:443 #metadata.broker.list=kaas-prod-elr-a.optum.com:443 auto.offset.reset=latest auto.reset.offsets=latest #schema.registry.url=https://azure-test-schema-registry.extscloud.com schema.registry.url=https://azure-prod-schema-registry.extscloud.com:443 schema.registry.ssl.keystore.location=/dbfs/mnt/ulp/artifacts/topics/certs/keystore.jks schema.registry.ssl.keystore.password=prmcert schema.registry.ssl.truststore.location=/dbfs/mnt/ulp/artifacts/topics/certs/truststore.jks schema.registry.ssl.truststore.password=prmcert schema.registry.ssl.key.password=prmcert security.protocol=SSL sasl.mechanism=PLAIN ssl.keystore.location=/dbfs/mnt/ulp/artifacts/topics/certs/keystore.jks ssl.keystore.password=******* ssl.truststore.location=/dbfs/mnt/ulp/artifacts/topics/certs/truststore.jks ssl.truststore.password=********** ssl.key.password=prmcert

nleena123 avatar May 13 '22 05:05 nleena123

Sorry for being late here. I see the class DFAvroKafkaSource in logs. Are you having your own custom class for reading the data?

pratyakshsharma avatar May 13 '22 06:05 pratyakshsharma

yes , below DFAvroKafkaSource.java file we are using to read the data which extends AvroSource. DF3HoodieDeltaStreamer.java which extends HoodieDeltaStreamer , and attached pom.xml file code.zip in the folder please have a look on it and let me know ,if any changes required in my code .

nleena123 avatar May 13 '22 07:05 nleena123

and can you share your deltastreamer configs please?

nsivabalan avatar May 13 '22 16:05 nsivabalan

are you trying out for the first time and running into issues. or was this running ok for sometime and all of a sudden you hit this when you upgraded or changed any configs?

nsivabalan avatar May 13 '22 16:05 nsivabalan

Hi @nsivabalan Below attached property file contain all configs that we used to This job. And used passed below arguments to databrick job (we are running hudi job through Azure databricks)

metrics.properties.txt

["--table-type","COPY_ON_WRITE","--source-ordering-field","CDC_TS","--source-class","com.optum.df.hudi.sources.DFAvroKafkaSource","--target-base-path","/mnt/ulp/dataassets-lake/metrics/","--target-table","metrics","--schemaprovider-class","org.apache.hudi.utilities.schema.SchemaRegistryProvider","--props","/mnt/ulp/artifacts/properties/metrics.properties"]

Note:- No we didn't change any configs .

nleena123 avatar May 16 '22 06:05 nleena123

I see hoodie.datasource.write.hive_styling_partioning should be changed to hoodie.datasource.write.hive_style_partitioning. Looking at your timeline, 20211211183554__commit__COMPLETED is the last time the job ran successfully, after that you have been running into issues. So we want to understand if you changed anything since this last successful commit?

pratyakshsharma avatar May 17 '22 07:05 pratyakshsharma

@pratyakshsharma

i have changed to hoodie.datasource.write.hive_style_partitioning=true and i ran the job , still i could see the same issue. i have not made any changes after 20211211183554__commit__COMPLETED

nleena123 avatar May 17 '22 09:05 nleena123

Yes, changing to hoodie.datasource.write.hive_style_partitioning will not fix your issue. That was just a typo in your configs (unrelated to this error), so I just highlighted this.

pratyakshsharma avatar May 17 '22 10:05 pratyakshsharma

Also I guess there is some confusion with respect to the code that you shared and the logs that you shared. In the code, I see multiple logs getting printed which are not there in the logs file you shared. Also the schema getting registered in the logs is not having CDC_TS field. Can you share latest logs? That might help in debugging the issue better.

pratyakshsharma avatar May 17 '22 10:05 pratyakshsharma

We are adding new field CDC_TS to Kafka data through code which present in DF3HoodieDeltaStreamer.java program. Code :- static GenericRecord getDataWithCDC_TS(GenericRecord record, String ts) {

	Schema base = record.getSchema();
	List<Schema.Field> baseFields = base.getFields().stream()
			.map(field -> {
				Schema.Field f = new Schema.Field(field.name(), field.schema(), field.doc(), field.defaultVal());
				field.aliases().forEach(f::addAlias);
				return f;
			})
			.collect(Collectors.toList());
	List<Schema.Field> newFields = new ArrayList<>(baseFields);
	ArrayList<Schema> optionalString = new ArrayList<>();
	optionalString.add(Schema.create(Schema.Type.NULL));
	optionalString.add(Schema.create(Schema.Type.STRING));
	//JsonNode nullDefault = JsonNodeFactory.instance.nullNode();

	//Schema.Field f = new Schema.Field(CDC_TS, Schema.createUnion(optionalString), CDC_TS, nullDefault);
	Schema.Field f = new Schema.Field(CDC_TS, Schema.createUnion(optionalString), CDC_TS, null);
	if (!baseFields.contains(f))
		newFields.add(f);


	Schema newSchema = Schema.createRecord(
			base.getName(),
			"New schema by adding a new field,CDC_TS",
			base.getNamespace(), base.isError(),
			newFields);

	LOG.info("newSchema is:" + newSchema);

	GenericRecord newRecord = new GenericData.Record(newSchema);

	base.getFields().stream().forEach(field -> newRecord.put(field.name(), record.get(field.name())));
	newRecord.put(CDC_TS, ts);

	return newRecord;

}

i have attached Latest log file , please have look on it. log4j-active_latest.txt

nleena123 avatar May 17 '22 11:05 nleena123

I have the same problem, and I haven't solved it yet

qianchutao avatar May 20 '22 10:05 qianchutao

Hi @xushiyan , @pratyakshsharma From very long back , i am seeing this issue , Can you please update on this , or please suggest me the resolution steps ?

nleena123 avatar May 31 '22 15:05 nleena123

Hi Team, Can you please update me on the above issue ??

nleena123 avatar Jun 06 '22 06:06 nleena123

HI @qianchutao , Are you able to fix the issue ? and can you please let me know which version of hudi are you using ?

nleena123 avatar Jun 07 '22 10:06 nleena123

I am also facing the similar exception, using EMR on AWS with kafka. Even exception is not self explanatory if any config is missing or what is exactly wrong

Exception:

22/08/09 07:30:39 INFO TaskSetManager: Finished task 0.0 in stage 66.0 (TID 2033) in 113 ms on ip-10-151-46-136.us-west-2.compute.internal (executor 1) (1/2) 22/08/09 07:30:39 INFO TaskSetManager: Finished task 1.0 in stage 66.0 (TID 2034) in 140 ms on ip-10-151-46-136.us-west-2.compute.internal (executor 1) (2/2) 22/08/09 07:30:39 INFO YarnScheduler: Removed TaskSet 66.0, whose tasks have all completed, from pool 22/08/09 07:30:39 INFO DAGScheduler: ResultStage 66 (collectAsMap at HoodieSparkEngineContext.java:151) finished in 0.151 s 22/08/09 07:30:39 INFO DAGScheduler: Job 32 is finished. Cancelling potential speculative or zombie tasks for this job 22/08/09 07:30:39 INFO YarnScheduler: Killing all running tasks in stage 66: Stage finished 22/08/09 07:30:39 INFO DAGScheduler: Job 32 finished: collectAsMap at HoodieSparkEngineContext.java:151, took 0.153794 s 22/08/09 07:30:39 ERROR HoodieDeltaStreamer: Shutting down delta-sync due to exception org.apache.hudi.exception.HoodieException: Commit 20220809073005189 failed and rolled-back ! at org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:649) at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:331) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:675) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) 22/08/09 07:30:39 ERROR HoodieAsyncService: Service shutdown with error java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieException: Commit 20220809073005189 failed and rolled-back ! at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) at org.apache.hudi.async.HoodieAsyncService.waitForShutdown(HoodieAsyncService.java:103) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$1(HoodieDeltaStreamer.java:189) at org.apache.hudi.common.util.Option.ifPresent(Option.java:97) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:186) at org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer.sync(HoodieMultiTableDeltaStreamer.java:416) at org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer.main(HoodieMultiTableDeltaStreamer.java:247) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1000) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1089) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1098) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.hudi.exception.HoodieException: Commit 20220809073005189 failed and rolled-back ! at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:709) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: org.apache.hudi.exception.HoodieException: Commit 20220809073005189 failed and rolled-back ! at org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:649) at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:331) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:675)

ROOBALJINDAL avatar Aug 09 '22 07:08 ROOBALJINDAL

HI @qianchutao , Are you able to fix the issue ? and can you please let me know which version of hudi are you using ?

Were you able to fix this issue?

ROOBALJINDAL avatar Aug 09 '22 07:08 ROOBALJINDAL

@nleena123 @ROOBALJINDAL : looks like the stacktrace does not have sufficient info to triage. So, can you folks enable debug logs and share what you see. We should probably get some insights into whats happening. Also, if we can somehow get a reproducible code, would really help us triage it faster. Since timeline events are very tough to grok manually looking at the logs.

nsivabalan avatar Aug 12 '22 02:08 nsivabalan

@nsivabalan @pratyakshsharma I tried enabling debug logs by changing root log4j.rootLogger=INFO to log4j.rootLogger=DEBUG in Hudi jar for following files

log4j.properties log4j-surefire.properties log4j-surefire-quiet.properties

Also updated some other properties in these files:

log4j.rootLogger=DEBUG, CONSOLE
log4j.logger.org.apache=DEBUG
log4j.logger.org.apache.hudi=DEBUG

I am using this custom hudi jar by following spark command:

spark-submit  \
  --class org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer s3://slava-redshift-test/hudi/roobal-hudi-Debug-logger/hudi-utilities-bundle_2.12-0.11.0_edfx_debug.jar \

I have also tried this:

spark-submit  \
  --conf spark.executor.extraJavaOptions=-Dlog4j.debug=true \

Can you help what I am missing?

Note: I am using EMR cluster on AWS.

ROOBALJINDAL avatar Aug 12 '22 07:08 ROOBALJINDAL

log4j.rootCategory=WARN, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
​
# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=WARN
​
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
​
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
​
# Example
log4j.logger.org.apache.hudi=DEBUG

sample log4j file.

--conf spark.driver.extraJavaOptions="-Dlog4j.configuration=file:/home/hadoop/log4j.properties" --conf spark.executor.extraJavaOptions="-Dlog4j.configuration=file:/home/hadoop/log4j.properties"

nsivabalan avatar Aug 28 '22 02:08 nsivabalan

@ROOBALJINDAL : can we have some updates.

nsivabalan avatar Sep 01 '22 17:09 nsivabalan

@nsivabalan yeah, issue worked for me. I logged separate issue and I was able to get help and resolve. Thank you team 👍 and fyi.. this original issue is logged by someone else so I cant close this issue.

ROOBALJINDAL avatar Sep 01 '22 17:09 ROOBALJINDAL

oh ok ok, got it. thanks. may I know which issue you are referring to.

nsivabalan avatar Sep 04 '22 17:09 nsivabalan

@nleena123 : are you still facing issues or are we good to close it out ?

nsivabalan avatar Sep 04 '22 17:09 nsivabalan