hudi
hudi copied to clipboard
[SUPPORT] AwsGlueCatalogSyncTool -The number of partition keys do not match the number of partition values
My job is just a wrapper around HoodieDeltaStreamer (yes, there are probably better ways to do this).
public class SparkHudiPoc {
public static void main(String[] args) throws Exception {
HoodieDeltaStreamer.main(args);
}
}
From pom.xml:
<properties>
<!-- DEPENDENCY VERSIONS -->
<hudi.version>0.11.1</hudi.version>
<scala.version>2.12.10</scala.version>
<spark.version>3.1.2</spark.version>
<aws-java-sdk.version>1.12.257</aws-java-sdk.version>
<hadoop.version>3.2.1</hadoop.version>
<parquet.version>1.10.0</parquet.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.hudi</groupId>
<artifactId>hudi-utilities-bundle_2.12</artifactId>
<version>${hudi.version}</version>
</dependency>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-avro</artifactId>
<version>${parquet.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>${hadoop.version}</version>
<exclusions>
<exclusion>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-bundle</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-s3</artifactId>
<version>${aws-java-sdk.version}</version>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-sts</artifactId>
<version>${aws-java-sdk.version}</version>
</dependency>
</dependencies>
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
</dependencies>
</dependencyManagement>
spark-submit
--master yarn
--deploy-mode client
s3://path-to/my-fat-jar.jar
--enable-sync
--disable-compaction
--sync-tool-classes org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
--min-sync-interval-seconds 60
--op UPSERT
--payload-class org.apache.hudi.common.model.debezium.MySqlDebeziumAvroPayload
--source-class org.apache.hudi.utilities.sources.debezium.MysqlDebeziumSource
--source-ordering-field _event_origin_ts_ms
--table-type MERGE_ON_READ
--target-base-path s3://my-bucket/path/table_name
--target-table table_name
--continuous
--hoodie-conf auto.offset.reset=earliest
--hoodie-conf bootstrap.servers=kafka-server:9092
--hoodie-conf group.id=spark-hudi-poc
--hoodie-conf schema.registry.url=http://registry:8081
--hoodie-conf hoodie.deltastreamer.schemaprovider.registry.url=http://registry:8081/subjects/CDC-value/versions/latest
--hoodie-conf hoodie.deltastreamer.source.kafka.topic=CDC
--hoodie-conf hoodie.datasource.hive_sync.database=spark-hudi-poc
--hoodie-conf hoodie.datasource.hive_sync.skip_ro_suffix=true
--hoodie-conf hoodie.datasource.hive_sync.table=table_name
--hoodie-conf hoodie.datasource.write.recordkey.field=id
--hoodie-conf hoodie.datasource.write.partitionpath.field=createdDate
--hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator
--hoodie-conf hoodie.deltastreamer.keygen.timebased.timezone=GMT
--hoodie-conf hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd
--hoodie-conf hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS
Environment Description
-
Hudi version : 0.11.1 (fat jar)
-
EMR 6.5.0
-
Spark version : 3.1.2
-
Hive version : 3.1.2
-
Hadoop version : Amazon 3.2.1
-
Storage : S3
-
Running on Docker? : no
Stacktrace
22/08/02 16:46:49 ERROR HoodieDeltaStreamer: Shutting down delta-sync due to exception
org.apache.hudi.exception.HoodieException: Could not sync using the meta sync class org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
at org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:61)
at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncMeta(DeltaSync.java:715)
at org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:634)
at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:333)
at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:679)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.hudi.exception.HoodieException: Got runtime exception when hive syncing table_name
at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:143)
at org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:59)
... 8 more
Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync partitions for table table_name
at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:414)
at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:232)
at org.apache.hudi.hive.HiveSyncTool.doSync(HiveSyncTool.java:156)
at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:140)
... 9 more
Caused by: org.apache.hudi.aws.sync.HoodieGlueSyncException: Fail to add partitions to spark-hudi-poc.table_name
at org.apache.hudi.aws.sync.AWSGlueCatalogSyncClient.addPartitionsToTable(AWSGlueCatalogSyncClient.java:147)
at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:397)
... 12 more
Caused by: com.amazonaws.services.glue.model.InvalidInputException: The number of partition keys do not match the number of partition values (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: 00f4d354-50a0-4b98-bce4-bab5569339c8; Proxy: null)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1862)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1415)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1384)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1154)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:811)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:779)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:753)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:713)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:695)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:559)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:539)
at com.amazonaws.services.glue.AWSGlueClient.doInvoke(AWSGlueClient.java:10640)
at com.amazonaws.services.glue.AWSGlueClient.invoke(AWSGlueClient.java:10607)
at com.amazonaws.services.glue.AWSGlueClient.invoke(AWSGlueClient.java:10596)
at com.amazonaws.services.glue.AWSGlueClient.executeBatchCreatePartition(AWSGlueClient.java:259)
at com.amazonaws.services.glue.AWSGlueClient.batchCreatePartition(AWSGlueClient.java:228)
at org.apache.hudi.aws.sync.AWSGlueCatalogSyncClient.addPartitionsToTable(AWSGlueCatalogSyncClient.java:139)
... 13 more
@zhedoubushishi @rahil-c could you guys help here?
@crutis you can actually troubleshoot this by writing a program with aws sdk to mimic org.apache.hudi.aws.sync.AWSGlueCatalogSyncClient#addPartitionsToTable. The list of partition values is logged by org.apache.hudi.hive.HiveSyncTool#syncPartitions so you have the input to that already. This problem needs some debugging to see what exactly the partition-keys-values mismatch meant. Let us know if you find anything. Also have you filed aws support case?
No support ticket with AWS yet, I'll check this out and let you know what I see, thanks!
@crutis : do you have any updates on this regard.
I'm sorry, this work got de-prioritized for a while. I may get a chance to work on this in the next two weeks, but more likely 4 weeks
@xushiyan : can you follow up on this.
@crutis there are some recent fixes wrt glue sync landed in master. if you get a chance, you may quickly try master see if issue resolved.
@crutis gentle ping to try out the master branch.
@crutis After revisiting this issue, I found that there is a bug if TimestampBasedKeyGenerator is used with the output dateformat containing slashes. I can reproduce your issue on EMR with Glue Data Catalog. The root cause is that the partition value extraction does not fetch the right values causing the meta sync to fail.
I've put up a PR to fix such issue and improve the usability: #6851. You can check the PR description for more details. I've tested that the exception above no longer shows up and the Glue sync is successful after the fix is applied.
To get unblocked, you may also use hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy-MM-dd to get around the bug.
@crutis Let us know if that solves your problem.
@crutis closing this as explained by @yihua . let us know how it works