incubator-xtable icon indicating copy to clipboard operation
incubator-xtable copied to clipboard

AWS Glue with Iceberg Source and Delta Target

Open ForeverAngry opened this issue 11 months ago • 41 comments

I seem to keep getting a version-hint.text error when trying to use the .jar file to sync an existing iceberg table to a delta, using AWS Glue as my catalog. Im using the my_config.yaml file as directed in the doc's, but i feel like im missing something.

For instance:

sourceFormat: ICEBERG
targetFormats:
  - DELTA
datasets:
  -
   # (i have this set to the parent path that holds the (metadata, data and _delta_log directories), ive tried other configurations as well.  The _delta_log directory gets created, but nothing gets created in it, as i keep getting the error indicated above)
    tableBasePath: s3://path
    tableName: my_table

Do i need to create a catalog.yaml and clients.yaml file as well? I noticed these but i thought they were only for implementing custom sources other than Hudi, Iceberg or Delta.

If i do need to create these files, what would i set as the catalogImpl: to for a source iceberg table that is managed by aws glue?

catalogImpl: io.my.CatalogImpl
catalogName: name
catalogOptions: # all other options are passed through in a map
  key1: value1
  key2: value2

Thanks in advance!

ForeverAngry avatar Feb 29 '24 23:02 ForeverAngry

Yes you will need a catalog yaml to specify which catalog implementation you are using for the Iceberg source. If none is specified, it assumes you're not using an external catalog. Once we get this working it would be good to add some feedback or submit a PR for updates to the README since this part is not so clear right now :)

the-other-tim-brown avatar Mar 01 '24 18:03 the-other-tim-brown

Yes you will need a catalog yaml to specify which catalog implementation you are using for the Iceberg source. If none is specified, it assumes you're not using an external catalog. Once we get this working it would be good to add some feedback or submit a PR for updates to the README since this part is not so clear right now :)

Hi thank you! You will have to forgive me because I'm about to sound dumb, but I guess, what would I put here

'catalogImpl: io.my.CatalogImpl'

If I am using aws managed glue with iceberg? And what keys would I need to provide?

ForeverAngry avatar Mar 01 '24 19:03 ForeverAngry

@the-other-tim-brown - I am also stuck on this particular issue, do you have an example of what this config would look like? Thanks.

MrDerecho avatar Mar 01 '24 19:03 MrDerecho

@MrDerecho to catalog with glue while running sync, you can look at this example: https://medium.com/@sagarlakshmipathy/using-onetable-to-translate-a-hudi-table-to-iceberg-format-and-sync-with-glue-catalog-8c3071f08877

sagarlakshmipathy avatar Mar 01 '24 22:03 sagarlakshmipathy

@MrDerecho to catalog with glue while running sync, you can look at this example: https://medium.com/@sagarlakshmipathy/using-onetable-to-translate-a-hudi-table-to-iceberg-format-and-sync-with-glue-catalog-8c3071f08877

@sagarlakshmipathy thanks for this info! If I'm running onetable in a docker container, do I need to use jdk 8 or 11? I noticed that the docs say 11, but the pom file that's builds the utilities jar uses 8. Maybe you can help demystify which jdk should be used for build vs. run?

ForeverAngry avatar Mar 02 '24 12:03 ForeverAngry

@ForeverAngry we are currently using java 11 for development and building the jars but we set a lower target so that projects that are still on java 8 can use the jar. Your runtime environment just needs to be java 8 or above for a java 8 jar to work properly.

the-other-tim-brown avatar Mar 02 '24 20:03 the-other-tim-brown

Maybe I'm doing something wrong here, but when I try to convert iceberg to delta using a glue catalog, I keep getting the error that the class constructor for catalog.Catalog is not found. I've tried adding the iceberg aws bundle to the class path, as well as many other jara, but still the same error. I'm executing the one table jar sync job from a docker image. Maybe the fact that it's a docker image is causing issues? I'm not sure. Does anyone have examples of specific versions of jars that should be used when converting from a iceberg source managed by glue to delta, that they know work? @the-other-tim-brown @sagarlakshmipathy any help would be very welcomed. I'd love to get this tool working!

ForeverAngry avatar Mar 04 '24 01:03 ForeverAngry

@ForeverAngry what is the error message you are seeing and how are you creating this docker image?

the-other-tim-brown avatar Mar 04 '24 04:03 the-other-tim-brown

@ForeverAngry what is the error message you are seeing and how are you creating this docker image?

@the-other-tim-brown Well, it was for sure an error with the image. But the error I'm getting now is related the partition. I have a column called timestamp' tbat ibuse a partition transform based on hour` on and now I get the following error:

org.apache.spark.sql.AnalysisException: cannot resolve 'date_format(timestamp, 'yyyy-MM-dd-HH')' due to data type mismatch: argument 1 requires timestamp type, however, 'timestamp' is of bigint type.; line 1 pos 0;

Any thoughts on how to resolve this error?

ForeverAngry avatar Mar 04 '24 17:03 ForeverAngry

@ForeverAngry when you say you have an "identity transformation" on the column, what do you mean by that? It sounds like you are using the hour transform from the Iceberg partitioning docs.

The exception looks like there could be an issue in how the source schema is converted. When converting Iceberg's "hidden partitioning" to Delta Lake, we use a generated column to represent the partition values.

the-other-tim-brown avatar Mar 04 '24 22:03 the-other-tim-brown

@the-other-tim-brown yeah I misspoke in my comment, you are correct, I'm using the hour transform.

ForeverAngry avatar Mar 05 '24 00:03 ForeverAngry

Ok can you attach more of the stacktrace so I can see where exactly that error is thrown?

the-other-tim-brown avatar Mar 05 '24 16:03 the-other-tim-brown

@the-other-tim-brown yeah here is the error:

2024-03-05 20:30:06 ERROR io.onetable.spi.sync.TableFormatSync:78 - Failed to sync snapshot org.apache.spark.sql.AnalysisException: cannot resolve 'date_format(timestamp, 'yyyy-MM-dd-HH')' due to data type mismatch: argument 1 requires timestamp type, however, 'timestamp' is of bigint type.; line 1 pos 0;

ForeverAngry avatar Mar 05 '24 20:03 ForeverAngry

@the-other-tim-brown - additional context is the column name is "timestamp" which is used as a partition transform at the "hour" level. Note, when I query the partitions they appear to be stored as unix hours, but the logical partitions in S3 are formed as "yyyy-mm-dd-HH"

MrDerecho avatar Mar 05 '24 20:03 MrDerecho

@the-other-tim-brown yeah here is the error:

2024-03-05 20:30:06 ERROR io.onetable.spi.sync.TableFormatSync:78 - Failed to sync snapshot org.apache.spark.sql.AnalysisException: cannot resolve 'date_format(timestamp, 'yyyy-MM-dd-HH')' due to data type mismatch: argument 1 requires timestamp type, however, 'timestamp' is of bigint type.; line 1 pos 0;

Is this the full stacktrace? If so I'll need to investigate why we're not getting more error content in the logs.

the-other-tim-brown avatar Mar 06 '24 02:03 the-other-tim-brown

@the-other-tim-brown that's not the full stack trace, just the error. The machine the program is on is separate and not able to post here. So I have to type alot of this in by hand.

Perhaps you could just let me know - does OneTable have the ability to take a timestamp field, stored as a timestamp, that uses the hour transform, from an iceberg source and convert that to delta, or is that not supported? I guess this is what I really want to know.

ForeverAngry avatar Mar 06 '24 15:03 ForeverAngry

This is definitely a bug somewhere in the partition handling logic. We want to support this case so it just requires developing an end to end test case in the repo to reproduce the issue in a way that is easier to debug.

the-other-tim-brown avatar Mar 06 '24 17:03 the-other-tim-brown

@the-other-tim-brown I was able to get the full stack trace printed out of the machine. I made most of the fields random names like "f1, f2 ... "except for the field the data is partitioned on using the hour transform from iceberg (the one that is failing). I also redacted the s3 path names.

Just to restate, im going from an existing iceberg table, in aws using s3, and glue - to delta (not databricks delta). The existing iceberg table was created by athena.

Take a look at let me know what you think or if anything stands out! Any help would be awesome. Other partition functions work fine. The project is really awesome!

2024-03-08 21:24:59 INFO  org.apache.spark.sql.delta.DeltaLog:57 - No delta log found for the Delta table at s3://<redacted>
2024-03-08 21:24:59 INFO  org.apache.spark.sql.delta.InitialSnapshot:57 - [tableId=0e175eb3-ad57-4432-8d71-cd3d58b839a2] Created snapshot InitialSnapshot(path=s3://<redacted>/data/_delta_log, version=-1, metadata=Metadata(081118d2-4985-476f-9za6-07b80359891c,null,null,Format(parquet,Map()),null,List(),Map(),Some(1709933099519)), logSegment=LogSegment(s3://<redacted>/data/_delta_log,-1,List(),List(),None,-1), checksumOpt=None)
2024-03-08 21:25:03 ERROR io.onetable.spi.sync.TableFormatSync:78 - Failed to sync snapshot
org.apache.spark.sql.AnalysisException: cannot resolve 'date_format(timestamp, 'yyyy-MM-dd-HH')' due to data type mismatch: argument 1 requires timestamp type, however, 'timestamp' is of bigint type.; line 1 pos 
0;
'Project [date_format(timestamp#91L, yyyy-MM-dd-HH, Some(Etc/UTC)) AS onetable_partition_col_HOUR_timestamp#135]
+- LocalRelation <empty>, [timestamp#91L, f0#92, f1#93, f2#94, f14#95, f11#96, f4#97, f13#98, f5#99, f6#100, f7#101L, f8#102L, f9#103, f10#104, f11#105, f12#106L, f13#107, f14#108, f15#109L, f16#110, f17#111L, f18#112]

        at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:190) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:175) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$2(TreeNode.scala:535) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:535) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1122) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1121) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:467) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUpWithPruning$1(QueryPlan.scala:181) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:209) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.immutable.List.foreach(List.scala:392) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.TraversableLike.map(TraversableLike.scala:238) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.TraversableLike.map$(TraversableLike.scala:231) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.immutable.List.map(List.scala:298) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:209) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUpWithPruning(QueryPlan.scala:181) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:161) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:175) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:94) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:263) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:94) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:91) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:182) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:205) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:202) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:88) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:196) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:196) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:88) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:86) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:78) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:90) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.Dataset.withPlan(Dataset.scala:3734) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.Dataset.select(Dataset.scala:1454) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.GeneratedColumn$.validateGeneratedColumns(GeneratedColumn.scala:196) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.$anonfun$verifyNewMetadata$2(OptimisticTransaction.scala:431) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile(DeltaLogging.scala:141) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile$(DeltaLogging.scala:139) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.recordFrameProfile(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.$anonfun$recordDeltaOperationInternal$1(DeltaLogging.scala:134) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at com.databricks.spark.util.DatabricksLogging.recordOperation(DatabricksLogging.scala:77) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at com.databricks.spark.util.DatabricksLogging.recordOperation$(DatabricksLogging.scala:67) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.recordOperation(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperationInternal(DeltaLogging.scala:133) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation(DeltaLogging.scala:123) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation$(DeltaLogging.scala:111) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.recordDeltaOperation(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.verifyNewMetadata(OptimisticTransaction.scala:431) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.verifyNewMetadata$(OptimisticTransaction.scala:402) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.verifyNewMetadata(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadataInternal(OptimisticTransaction.scala:382) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadataInternal$(OptimisticTransaction.scala:296) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.updateMetadataInternal(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadata(OptimisticTransaction.scala:289) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadata$(OptimisticTransaction.scala:284) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.updateMetadata(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient$TransactionState.commitTransaction(DeltaClient.java:259) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient$TransactionState.access$300(DeltaClient.java:215) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient.completeSync(DeltaClient.java:198) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.spi.sync.TableFormatSync.getSyncResult(TableFormatSync.java:165) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.spi.sync.TableFormatSync.syncSnapshot(TableFormatSync.java:70) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.client.OneTableClient.syncSnapshot(OneTableClient.java:179) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.client.OneTableClient.sync(OneTableClient.java:116) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.utilities.RunSync.main(RunSync.java:168) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
2024-03-08 21:25:03 ERROR io.onetable.client.OneTableClient:133 - Sync failed for the following formats DELTA

ForeverAngry avatar Mar 08 '24 21:03 ForeverAngry

Can you confirm the field schema type in the iceberg metadata and in the parquet file metadata? I tried a local example with hour partitioning on a timestamp field and it worked. I'm wondering if the field is actually a long in the schema.

the-other-tim-brown avatar Mar 09 '24 04:03 the-other-tim-brown

@the-other-tim-brown so, i rebuilt the table again, to make sure that there wasent something odd with the partitions. Its odd, the only change i made when rebuilding the table was using upper case PARTITION BY HOUR(timestamp), but ... now im getting a different error:

Output: 2024-03-09 00:06:16 INFO  io.onetable.utilities.RunSync:147 - Running sync for basePath s3://<redacted> for following table formats [DELTA]
2024-03-09 00:06:17 WARN  org.apache.hadoop.util.NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-03-09 00:06:19 WARN  org.apache.hadoop.metrics2.impl.MetricsConfig:136 - Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
2024-03-09 00:06:19 INFO  org.apache.spark.sql.delta.storage.DelegatingLogStore:57 - LogStore `LogStoreAdapter(io.delta.storage.S3SingleDriverLogStore)` is used for scheme `s3`
2024-03-09 00:06:20 INFO  org.apache.spark.sql.delta.DeltaLog:57 - Creating initial snapshot without metadata, because the directory is empty
2024-03-09 00:06:20 INFO  org.apache.spark.sql.delta.InitialSnapshot:57 - [tableId=cafeb292-12a0-4a27-9382-d436a0977e46] Created snapshot InitialSnapshot(path=s3://<redacted>/data/_delta_log, version=-1, metadata=Metadata(ad0d3e83-149a-4904-afd8-8d655b49f1f8,null,null,Format(parquet,Map()),null,List(),Map(),Some(1709942780618)), logSegment=LogSegment(s3://<redacted>/data/_delta_log,-1,List(),List(),None,-1), checksumOpt=None)
2024-03-09 00:06:20 INFO  io.onetable.client.OneTableClient:234 - No previous OneTable sync for target. Falling back to snapshot sync.
2024-03-09 00:06:21 INFO  org.apache.iceberg.CatalogUtil:302 - Loading custom FileIO implementation: org.apache.iceberg.aws.s3.S3FileIO
2024-03-09 00:06:22 INFO  org.apache.iceberg.BaseMetastoreTableOperations:199 - Refreshing table metadata from new version: s3://<redacted>/metadata/00000-585a7096-6160-4bc7-a3d9-a60634380092.metadata.json
2024-03-09 00:06:22 INFO  org.apache.iceberg.BaseMetastoreCatalog:67 - Table loaded by catalog: onetable.one_table_demo.bluvector_sensor_zeek_connection_log_dev_test_v3
2024-03-09 00:06:22 INFO  org.apache.iceberg.SnapshotScan:124 - Scanning table onetable.one_table_demo.bluvector_sensor_zeek_connection_log_dev_test_v3 snapshot 802574091683561956 created at 2024-03-08T22:31:16.976+00:00 with filter true
2024-03-09 00:06:23 INFO  org.apache.iceberg.metrics.LoggingMetricsReporter:38 - Received metrics report: ScanReport{tableName=onetable.one_table_demo.bluvector_sensor_zeek_connection_log_dev_test_v3, snapshotId=802574091683561956, filter=true, schemaId=0, projectedFieldIds=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], projectedFieldNames=[timestamp, uid, sensor_name, 
source_ip, source_port, destination_ip, destination_port, proto, service, duration, source_bytes, destination_bytes, conn_state, local_source, local_destination, missed_bytes, history, source_pkts, source_ip_bytes, destination_pkts, destination_ip_bytes, tunnel_parents], scanMetrics=ScanMetricsResult{totalPlanningDuration=TimerResult{timeUnit=NANOSECONDS, totalDuration=PT0.606986723S, count=1}, resultDataFiles=CounterResult{unit=COUNT, value=116}, resultDeleteFiles=CounterResult{unit=COUNT, value=0}, totalDataManifests=CounterResult{unit=COUNT, value=1}, totalDeleteManifests=CounterResult{unit=COUNT, value=0}, scannedDataManifests=CounterResult{unit=COUNT, value=1}, skippedDataManifests=CounterResult{unit=COUNT, value=0}, totalFileSizeInBytes=CounterResult{unit=BYTES, value=3635593429}, totalDeleteFileSizeInBytes=CounterResult{unit=BYTES, value=0}, skippedDataFiles=CounterResult{unit=COUNT, value=0}, skippedDeleteFiles=CounterResult{unit=COUNT, value=0}, scannedDeleteManifests=CounterResult{unit=COUNT, value=0}, skippedDeleteManifests=CounterResult{unit=COUNT, value=0}, indexedDeleteFiles=CounterResult{unit=COUNT, value=0}, equalityDeleteFiles=CounterResult{unit=COUNT, value=0}, positionalDeleteFiles=CounterResult{unit=COUNT, value=0}}, metadata={iceberg-version=Apache Iceberg 1.4.2 (commit f6bb9173b13424d77e7ad8439b5ef9627e530cb2)}}
2024-03-09 00:06:23 ERROR io.onetable.utilities.RunSync:170 - Error running sync for s3://<redacted>
java.lang.NullPointerException: null
        at io.onetable.iceberg.IcebergPartitionValueConverter.toOneTable(IcebergPartitionValueConverter.java:117) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.iceberg.IcebergSourceClient.fromIceberg(IcebergSourceClient.java:165) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.iceberg.IcebergSourceClient.getCurrentSnapshot(IcebergSourceClient.java:147) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.spi.extractor.ExtractFromSource.extractSnapshot(ExtractFromSource.java:35) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.client.OneTableClient.syncSnapshot(OneTableClient.java:177) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.client.OneTableClient.sync(OneTableClient.java:116) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.utilities.RunSync.main(RunSync.java:168) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]


ForeverAngry avatar Mar 09 '24 18:03 ForeverAngry

@ForeverAngry can you grab the schema of the Iceberg table?

the-other-tim-brown avatar Mar 09 '24 18:03 the-other-tim-brown

@the-other-tim-brown Here it is, let me know if this wasnt what you were looking for!

{
  "format-version" : 2,
  "table-uuid" : "03e38c37-0ac5-4cae-a66f-49038931f265",
  "location" : "s3://<redacted>",
  "last-sequence-number" : 114,
  "last-updated-ms" : 1709948427780,
  "last-column-id" : 22,
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ {
      "id" : 1,
      "name" : "timestamp",
      "required" : false,
      "type" : "timestamp"
    }, {
      "id" : 2,
      "name" : "uid",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 3,
      "name" : "sensor_name",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 4,
      "name" : "source_ip",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 5,
      "name" : "source_port",
      "required" : false,
      "type" : "int"
    }, {
      "id" : 6,
      "name" : "destination_ip",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 7,
      "name" : "destination_port",
      "required" : false,
      "type" : "int"
    }, {
      "id" : 8,
      "name" : "proto",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 9,
      "name" : "service",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 10,
      "name" : "duration",
      "required" : false,
      "type" : "double"
    }, {
      "id" : 11,
      "name" : "source_bytes",
      "required" : false,
      "type" : "long"
    }, {
      "id" : 12,
      "name" : "destination_bytes",
      "required" : false,
      "type" : "long"
    }, {
      "id" : 13,
      "name" : "conn_state",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 14,
      "name" : "local_source",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 15,
      "name" : "local_destination",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 16,
      "name" : "missed_bytes",
      "required" : false,
      "type" : "long"
    }, {
      "id" : 17,
      "name" : "history",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 18,
      "name" : "source_pkts",
      "required" : false,
      "type" : "int"
    }, {
      "id" : 19,
      "name" : "source_ip_bytes",
      "required" : false,
      "type" : "long"
    }, {
      "id" : 20,
      "name" : "destination_pkts",
      "required" : false,
      "type" : "int"
    }, {
      "id" : 21,
      "name" : "destination_ip_bytes",
      "required" : false,
      "type" : "long"
    }, {
      "id" : 22,
      "name" : "tunnel_parents",
      "required" : false,
      "type" : "string"
    } ]
  } ],
  "default-spec-id" : 0,
  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ {
      "name" : "timestamp_hour",
      "transform" : "hour",
      "source-id" : 1,
      "field-id" : 1000
    } ]
  } ],
  "last-partition-id" : 1000,
  "default-sort-order-id" : 0,
  "sort-orders" : [ {
    "order-id" : 0,
    "fields" : [ ]
  } ],
  "properties" : {
    "history.expire.max-snapshot-age-ms" : "259200000",
    "athena.optimize.rewrite.min-data-file-size-bytes" : "50000000",
    "write.delete.parquet.compression-codec" : "GZIP",
    "write.format.default" : "PARQUET",
    "write.parquet.compression-codec" : "GZIP",
    "write.binpack.min-file-size-bytes" : "50000000",
    "write.object-storage.enabled" : "true",
    "write.object-storage.path" : "s3://<redacted>/data"
  },

ForeverAngry avatar Mar 09 '24 19:03 ForeverAngry

Thanks, this is what I was looking for. It lines up with my expectations but I'm not able to reproduce the same error that you saw. Are any of your partitions null or before 1970-01-01? Just trying to think through the common time-based edge cases here.

the-other-tim-brown avatar Mar 10 '24 00:03 the-other-tim-brown

@the-other-tim-brown So the table had some null values in the timestamp. I got rid of those. Now im back to the original error i started with.

The timestamps include the millisecond, but do not have timezone. Do you think the millisecond precision is an issue?

Also, maybe - can you tell me what Jar versions you are using when you run your test? Maybe mine are causing an issue.

Here is the stack:

Output: 2024-03-10 02:42:51 INFO  io.onetable.utilities.RunSync:147 - Running sync for basePath s3://<redacted> for following table formats [DELTA]
2024-03-10 02:42:51 WARN  org.apache.hadoop.util.NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-03-10 02:42:53 WARN  org.apache.hadoop.metrics2.impl.MetricsConfig:136 - Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
2024-03-10 02:42:54 INFO  org.apache.spark.sql.delta.storage.DelegatingLogStore:57 - LogStore `LogStoreAdapter(io.delta.storage.S3SingleDriverLogStore)` is used for scheme `s3`
2024-03-10 02:42:55 INFO  org.apache.spark.sql.delta.DeltaLog:57 - Creating initial snapshot without metadata, because the directory is empty
2024-03-10 02:42:55 INFO  org.apache.spark.sql.delta.InitialSnapshot:57 - [tableId=5ae55a82-f906-438d-8150-a4a8bbec1bc0] Created snapshot InitialSnapshot(path=s3://<redacted>/data/_delta_log, version=-1, metadata=Metadata(7a370e68-bd4f-4908-8893-6bd01b5969b3,null,null,Format(parquet,Map()),null,List(),Map(),Some(1710038575593)), logSegment=LogSegment(s3://<redacted>/data/_delta_log,-1,List(),List(),None,-1), checksumOpt=None)
2024-03-10 02:42:55 INFO  io.onetable.client.OneTableClient:234 - No previous OneTable sync for target. Falling back to snapshot sync.
2024-03-10 02:42:56 INFO  org.apache.iceberg.CatalogUtil:302 - Loading custom FileIO implementation: org.apache.iceberg.aws.s3.S3FileIO
2024-03-10 02:42:57 INFO  org.apache.iceberg.BaseMetastoreTableOperations:199 - Refreshing table metadata from new version: s3://<redacted>/metadata/00116-d628cb20-f502-445e-a87a-b0639b830259.metadata.json   
2024-03-10 02:42:58 INFO  org.apache.iceberg.BaseMetastoreCatalog:67 - Table loaded by catalog: onetable.titanflow_laker_trusted.bluvector_sensor_zeek_connection_log_dev
2024-03-10 02:42:58 INFO  org.apache.iceberg.SnapshotScan:124 - Scanning table onetable.titanflow_laker_trusted.bluvector_sensor_zeek_connection_log_dev snapshot 8863237097523144183 created at 2024-03-10T02:42:10.959+00:00 with filter true
2024-03-10 02:42:59 INFO  org.apache.iceberg.metrics.LoggingMetricsReporter:38 - Received metrics report: ScanReport{tableName=onetable.titanflow_laker_trusted.bluvector_sensor_zeek_connection_log_dev, snapshotId=8863237097523144183, filter=true, schemaId=0, projectedFieldIds=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], projectedFieldNames=[timestamp, uid, sensor_name, source_ip, source_port, destination_ip, destination_port, proto, service, duration, source_bytes, destination_bytes, conn_state, local_source, local_destination, missed_bytes, history, source_pkts, source_ip_bytes, destination_pkts, destination_ip_bytes, tunnel_parents], scanMetrics=ScanMetricsResult{totalPlanningDuration=TimerResult{timeUnit=NANOSECONDS, totalDuration=PT0.98888014S, count=1}, resultDataFiles=CounterResult{unit=COUNT, value=456}, resultDeleteFiles=CounterResult{unit=COUNT, value=0}, totalDataManifests=CounterResult{unit=COUNT, value=71}, totalDeleteManifests=CounterResult{unit=COUNT, value=0}, scannedDataManifests=CounterResult{unit=COUNT, value=58}, skippedDataManifests=CounterResult{unit=COUNT, value=13}, totalFileSizeInBytes=CounterResult{unit=BYTES, value=13005026723}, totalDeleteFileSizeInBytes=CounterResult{unit=BYTES, value=0}, skippedDataFiles=CounterResult{unit=COUNT, value=0}, skippedDeleteFiles=CounterResult{unit=COUNT, value=0}, scannedDeleteManifests=CounterResult{unit=COUNT, value=0}, skippedDeleteManifests=CounterResult{unit=COUNT, value=0}, indexedDeleteFiles=CounterResult{unit=COUNT, value=0}, equalityDeleteFiles=CounterResult{unit=COUNT, value=0}, positionalDeleteFiles=CounterResult{unit=COUNT, value=0}}, metadata={iceberg-version=Apache Iceberg 1.4.2 (commit f6bb9173b13424d77e7ad8439b5ef9627e530cb2)}}
2024-03-10 02:42:59 INFO  org.apache.spark.sql.delta.DeltaLog:57 - No delta log found for the Delta table at s3://<redacted>/data/_delta_log
2024-03-10 02:42:59 INFO  org.apache.spark.sql.delta.InitialSnapshot:57 - [tableId=7a370e68-bd4f-4908-8893-6bd01b5969b3] Created snapshot InitialSnapshot(path=s3://<redacted>/data/_delta_log, version=-1, metadata=Metadata(884081c4-c241-417c-858a-ef3714fa54dc,null,null,Format(parquet,Map()),null,List(),Map(),Some(1710038579829)), logSegment=LogSegment(s3://<redacted>/data/_delta_log,-1,List(),List(),None,-1), checksumOpt=None)
2024-03-10 02:43:05 ERROR io.onetable.spi.sync.TableFormatSync:78 - Failed to sync snapshot
org.apache.spark.sql.AnalysisException: cannot resolve 'date_format(timestamp, 'yyyy-MM-dd-HH')' due to data type mismatch: argument 1 requires timestamp type, however, 'timestamp' is of bigint type.; line 1 pos 0;
'Project [date_format(timestamp#91L, yyyy-MM-dd-HH, Some(Etc/UTC)) AS onetable_partition_col_HOUR_timestamp#135]
+- LocalRelation <empty>, [timestamp#91L, uid#92, sensor_name#93, source_ip#94, source_port#95, destination_ip#96, destination_port#97, proto#98, service#99, duration#100, source_bytes#101L, destination_bytes#102L, conn_state#103, local_source#104, local_destination#105, missed_bytes#106L, history#107, source_pkts#108, source_ip_bytes#109L, destination_pkts#110, destination_ip_bytes#111L, tunnel_parents#112]

        at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:190) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:175) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$2(TreeNode.scala:535) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:535) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1122) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1121) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:467) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUpWithPruning$1(QueryPlan.scala:181) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:209) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.immutable.List.foreach(List.scala:392) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.TraversableLike.map(TraversableLike.scala:238) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.TraversableLike.map$(TraversableLike.scala:231) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.immutable.List.map(List.scala:298) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:209) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUpWithPruning(QueryPlan.scala:181) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:161) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:175) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:94) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:263) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:94) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:91) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:182) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:205) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:202) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:88) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:196) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:196) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:88) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:86) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:78) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:90) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.Dataset.withPlan(Dataset.scala:3734) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.Dataset.select(Dataset.scala:1454) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.GeneratedColumn$.validateGeneratedColumns(GeneratedColumn.scala:196) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.$anonfun$verifyNewMetadata$2(OptimisticTransaction.scala:431) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile(DeltaLogging.scala:141) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile$(DeltaLogging.scala:139) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.recordFrameProfile(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.$anonfun$recordDeltaOperationInternal$1(DeltaLogging.scala:134) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at com.databricks.spark.util.DatabricksLogging.recordOperation(DatabricksLogging.scala:77) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at com.databricks.spark.util.DatabricksLogging.recordOperation$(DatabricksLogging.scala:67) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.recordOperation(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperationInternal(DeltaLogging.scala:133) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation(DeltaLogging.scala:123) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation$(DeltaLogging.scala:111) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.recordDeltaOperation(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.verifyNewMetadata(OptimisticTransaction.scala:431) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.verifyNewMetadata$(OptimisticTransaction.scala:402) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.verifyNewMetadata(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadataInternal(OptimisticTransaction.scala:382) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadataInternal$(OptimisticTransaction.scala:296) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.updateMetadataInternal(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadata(OptimisticTransaction.scala:289) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadata$(OptimisticTransaction.scala:284) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.updateMetadata(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient$TransactionState.commitTransaction(DeltaClient.java:259) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient$TransactionState.access$300(DeltaClient.java:215) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient.completeSync(DeltaClient.java:198) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.spi.sync.TableFormatSync.getSyncResult(TableFormatSync.java:165) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.spi.sync.TableFormatSync.syncSnapshot(TableFormatSync.java:70) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.client.OneTableClient.syncSnapshot(OneTableClient.java:179) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.client.OneTableClient.sync(OneTableClient.java:116) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.utilities.RunSync.main(RunSync.java:168) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
2024-03-10 02:43:05 ERROR io.onetable.client.OneTableClient:133 - Sync failed for the following formats DELTA

ForeverAngry avatar Mar 10 '24 02:03 ForeverAngry

@ForeverAngry can you confirm that you are on the latest main branch when testing? If so, can you rebuild off of this branch and test again? https://github.com/apache/incubator-xtable/pull/382 I think there is some issue in how we're handling timestamps with/without timezone set.

the-other-tim-brown avatar Mar 10 '24 03:03 the-other-tim-brown

@the-other-tim-brown i re-pulled the main branch again, and rebuilt the Jar. Note - i noticed that the run sync class changed from io.onetable.utiliities.RunSync to org.apache.xtable.utilities.RunSync, just an fyi - i wasn't sure if this was expected or not. Anyway - looks like im still getting the same error. Let me know what you think.

Output: 2024-03-10 13:06:33 INFO  org.apache.xtable.utilities.RunSync:147 - Running sync for basePath s3://<redacted> for following table formats [DELTA]
2024-03-10 13:06:35 WARN  org.apache.hadoop.util.NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-03-10 13:06:39 WARN  org.apache.hadoop.metrics2.impl.MetricsConfig:136 - Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
2024-03-10 13:06:40 WARN  org.apache.hadoop.fs.s3a.SDKV2Upgrade:39 - Directly referencing AWS SDK V1 credential provider com.amazonaws.auth.DefaultAWSCredentialsProviderChain. AWS SDK V1 credential providers will be removed once S3A is upgraded to SDK V2
2024-03-10 13:06:40 INFO  org.apache.spark.sql.delta.storage.DelegatingLogStore:60 - LogStore `LogStoreAdapter(io.delta.storage.S3SingleDriverLogStore)` is used for scheme `s3`
2024-03-10 13:06:42 INFO  org.apache.spark.sql.delta.DeltaLog:60 - Creating initial snapshot without metadata, because the directory is empty
2024-03-10 13:06:44 INFO  org.apache.spark.sql.delta.InitialSnapshot:60 - [tableId=37adbca2-cc8a-4d2f-a2b2-5a891368c9a5] Created snapshot InitialSnapshot(path=s3://<redacted>/data/_delta_log, version=-1, metadata=Metadata(608fba2e-531b-455c-8a92-e14f1086a557,null,null,Format(parquet,Map()),null,List(),Map(),Some(1710076004533)), logSegment=LogSegment(s3://<redacted>/data/_delta_log,-1,List(),None,-1), checksumOpt=None)
2024-03-10 13:06:44 INFO  org.apache.xtable.client.OneTableClient:234 - No previous OneTable sync for target. Falling back to snapshot sync.
2024-03-10 13:06:45 INFO  org.apache.iceberg.CatalogUtil:302 - Loading custom FileIO implementation: org.apache.iceberg.aws.s3.S3FileIO
2024-03-10 13:06:46 INFO  org.apache.iceberg.BaseMetastoreTableOperations:199 - Refreshing table metadata from new version: s3://<redacted>/metadata/00116-d628cb20-f502-445e-a87a-b0639b830259.metadata.json   
2024-03-10 13:06:47 INFO  org.apache.iceberg.BaseMetastoreCatalog:67 - Table loaded by catalog: onetable.<redacted>
2024-03-10 13:06:47 INFO  org.apache.iceberg.SnapshotScan:124 - Scanning table onetable.<redacted> snapshot 8863237097523144183 created at 2024-03-10T02:42:10.959+00:00 with filter true
2024-03-10 13:06:48 INFO  org.apache.iceberg.metrics.LoggingMetricsReporter:38 - Received metrics report: ScanReport{tableName=onetable.<redacted>, snapshotId=8863237097523144183, filter=true, schemaId=0, projectedFieldIds=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], projectedFieldNames=[timestamp, uid, sensor_name, source_ip, source_port, destination_ip, destination_port, proto, service, duration, source_bytes, destination_bytes, conn_state, local_source, local_destination, missed_bytes, history, source_pkts, source_ip_bytes, destination_pkts, destination_ip_bytes, tunnel_parents], scanMetrics=ScanMetricsResult{totalPlanningDuration=TimerResult{timeUnit=NANOSECONDS, totalDuration=PT1.291327918S, count=1}, resultDataFiles=CounterResult{unit=COUNT, value=456}, resultDeleteFiles=CounterResult{unit=COUNT, value=0}, totalDataManifests=CounterResult{unit=COUNT, value=71}, totalDeleteManifests=CounterResult{unit=COUNT, value=0}, scannedDataManifests=CounterResult{unit=COUNT, value=58}, skippedDataManifests=CounterResult{unit=COUNT, value=13}, totalFileSizeInBytes=CounterResult{unit=BYTES, value=13005026723}, totalDeleteFileSizeInBytes=CounterResult{unit=BYTES, value=0}, skippedDataFiles=CounterResult{unit=COUNT, value=0}, skippedDeleteFiles=CounterResult{unit=COUNT, value=0}, scannedDeleteManifests=CounterResult{unit=COUNT, value=0}, skippedDeleteManifests=CounterResult{unit=COUNT, value=0}, indexedDeleteFiles=CounterResult{unit=COUNT, value=0}, equalityDeleteFiles=CounterResult{unit=COUNT, value=0}, positionalDeleteFiles=CounterResult{unit=COUNT, value=0}}, metadata={iceberg-version=Apache Iceberg 1.4.2 (commit f6bb9173b13424d77e7ad8439b5ef9627e530cb2)}}   
2024-03-10 13:06:49 INFO  org.apache.spark.sql.delta.DeltaLog:60 - No delta log found for the Delta table at s3://<redacted>/data/_delta_log
2024-03-10 13:06:49 INFO  org.apache.spark.sql.delta.InitialSnapshot:60 - [tableId=608fba2e-531b-455c-8a92-e14f1086a557] Created snapshot InitialSnapshot(path=s3://<redacted>/data/_delta_log, version=-1, metadata=Metadata(6b778d50-11e8-4490-9892-2d971f078703,null,null,Format(parquet,Map()),null,List(),Map(),Some(1710076009117)), logSegment=LogSegment(s3://<redacted>/data/_delta_log,-1,List(),None,-1), checksumOpt=None)
2024-03-10 13:06:59 ERROR org.apache.xtable.spi.sync.TableFormatSync:78 - Failed to sync snapshot
org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "date_format(timestamp, yyyy-MM-dd-HH)" due to data type mismatch: Parameter 1 requires the "TIMESTAMP" type, however "timestamp" has the type "BIGINT".; line 1 pos 0
        at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:269) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:256) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.Iterator.foreach(Iterator.scala:943) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.Iterator.foreach$(Iterator.scala:943) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.IterableLike.foreach(IterableLike.scala:74) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.IterableLike.foreach$(IterableLike.scala:73) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.AbstractIterable.foreach(Iterable.scala:56) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$4(CheckAnalysis.scala:256) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$4$adapted(CheckAnalysis.scala:256) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.immutable.Stream.foreach(Stream.scala:533) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$1(CheckAnalysis.scala:256) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$1$adapted(CheckAnalysis.scala:163) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:163) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:160) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:188) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:156) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:146) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:188) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.GeneratedColumn$.validateColumnReferences(GeneratedColumn.scala:185) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.GeneratedColumn$.$anonfun$validateGeneratedColumns$2(GeneratedColumn.scala:214) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.immutable.List.map(List.scala:293) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.GeneratedColumn$.validateGeneratedColumns(GeneratedColumn.scala:210) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.$anonfun$assertMetadata$2(OptimisticTransaction.scala:611) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile(DeltaLogging.scala:140) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile$(DeltaLogging.scala:138) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.recordFrameProfile(OptimisticTransaction.scala:137) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.$anonfun$recordDeltaOperationInternal$1(DeltaLogging.scala:133) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at com.databricks.spark.util.DatabricksLogging.recordOperation(DatabricksLogging.scala:128) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at com.databricks.spark.util.DatabricksLogging.recordOperation$(DatabricksLogging.scala:117) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.recordOperation(OptimisticTransaction.scala:137) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperationInternal(DeltaLogging.scala:132) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation(DeltaLogging.scala:122) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation$(DeltaLogging.scala:112) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.recordDeltaOperation(OptimisticTransaction.scala:137) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.assertMetadata(OptimisticTransaction.scala:611) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.assertMetadata$(OptimisticTransaction.scala:583) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.assertMetadata(OptimisticTransaction.scala:137) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadataInternal(OptimisticTransaction.scala:565) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadataInternal$(OptimisticTransaction.scala:395) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.updateMetadataInternal(OptimisticTransaction.scala:137) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadata(OptimisticTransaction.scala:379) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadata$(OptimisticTransaction.scala:372) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.updateMetadata(OptimisticTransaction.scala:137) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.xtable.delta.DeltaClient$TransactionState.commitTransaction(DeltaClient.java:261) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.xtable.delta.DeltaClient$TransactionState.access$300(DeltaClient.java:217) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.xtable.delta.DeltaClient.completeSync(DeltaClient.java:199) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.xtable.spi.sync.TableFormatSync.getSyncResult(TableFormatSync.java:165) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.xtable.spi.sync.TableFormatSync.syncSnapshot(TableFormatSync.java:70) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.xtable.client.OneTableClient.syncSnapshot(OneTableClient.java:179) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.xtable.client.OneTableClient.sync(OneTableClient.java:116) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.xtable.utilities.RunSync.main(RunSync.java:168) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
2024-03-10 13:06:59 ERROR org.apache.xtable.client.OneTableClient:133 - Sync failed for the following formats DELTA

ForeverAngry avatar Mar 10 '24 13:03 ForeverAngry

@ForeverAngry the change in paths is expected and required as we transition this project to the Apache Software Foundation. We will need to do a full revamp of the docs/references once this process is complete.

Did you try building off of the branch I linked as well?

the-other-tim-brown avatar Mar 10 '24 14:03 the-other-tim-brown

@the-other-tim-brown Well, I feel a bit foolish - for some reason I thought you only wanted me to re pull and test the main branch. Sorry, I'll pull that branch you linked and let you know!

Again, I appreciate the dialog and support - your awesome!

ForeverAngry avatar Mar 10 '24 14:03 ForeverAngry

@the-other-tim-brown Okay, i rebuilt the jar using the branch you provided, here is the stack trace:

Output: 2024-03-10 15:59:14 INFO  io.onetable.utilities.RunSync:147 - Running sync for basePath s3://<redacted> for following table formats [DELTA]
2024-03-10 15:59:15 WARN  org.apache.hadoop.util.NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-03-10 15:59:21 WARN  org.apache.hadoop.metrics2.impl.MetricsConfig:136 - Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
2024-03-10 15:59:22 INFO  org.apache.spark.sql.delta.storage.DelegatingLogStore:60 - LogStore `LogStoreAdapter(io.delta.storage.S3SingleDriverLogStore)` is used for scheme `s3`
2024-03-10 15:59:23 INFO  org.apache.spark.sql.delta.DeltaLog:60 - Creating initial snapshot without metadata, because the directory is empty
2024-03-10 15:59:25 INFO  org.apache.spark.sql.delta.InitialSnapshot:60 - [tableId=f5cd0f71-bf22-4b57-8447-b5520c7b7d99] Created snapshot InitialSnapshot(path=s3://<redacted>/data/_delta_log, version=-1, metadata=Metadata(23f44368-3cd6-451f-af9a-d87c7cd8ee9a,null,null,Format(parquet,Map()),null,List(),Map(),Some(1710086365401)), logSegment=LogSegment(s3://<redacted>/data/_delta_log,-1,List(),None,-1), checksumOpt=None)
2024-03-10 15:59:25 INFO  io.onetable.client.OneTableClient:234 - No previous OneTable sync for target. Falling back to snapshot sync.
2024-03-10 15:59:26 INFO  org.apache.iceberg.CatalogUtil:302 - Loading custom FileIO implementation: org.apache.iceberg.aws.s3.S3FileIO
2024-03-10 15:59:27 INFO  org.apache.iceberg.BaseMetastoreTableOperations:199 - Refreshing table metadata from new version: s3://<redacted>/metadata/00116-d628cb20-f502-445e-a87a-b0639b830259.metadata.json   
2024-03-10 15:59:27 INFO  org.apache.iceberg.BaseMetastoreCatalog:67 - Table loaded by catalog: onetable.<redacted>
2024-03-10 15:59:27 INFO  org.apache.iceberg.SnapshotScan:124 - Scanning table onetable.<redacted> snapshot 8863237097523144183 created at 2024-03-10T02:42:10.959+00:00 with filter true
2024-03-10 15:59:29 INFO  org.apache.iceberg.metrics.LoggingMetricsReporter:38 - Received metrics report: ScanReport{tableName=onetable.<redacted>, snapshotId=8863237097523144183, filter=true, schemaId=0, projectedFieldIds=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], projectedFieldNames=[timestamp, uid, sensor_name, source_ip, source_port, destination_ip, destination_port, proto, service, duration, source_bytes, destination_bytes, conn_state, local_source, local_destination, missed_bytes, history, source_pkts, source_ip_bytes, destination_pkts, destination_ip_bytes, tunnel_parents], scanMetrics=ScanMetricsResult{totalPlanningDuration=TimerResult{timeUnit=NANOSECONDS, totalDuration=PT1.189127061S, count=1}, resultDataFiles=CounterResult{unit=COUNT, value=456}, resultDeleteFiles=CounterResult{unit=COUNT, value=0}, totalDataManifests=CounterResult{unit=COUNT, value=71}, totalDeleteManifests=CounterResult{unit=COUNT, value=0}, scannedDataManifests=CounterResult{unit=COUNT, value=58}, skippedDataManifests=CounterResult{unit=COUNT, value=13}, totalFileSizeInBytes=CounterResult{unit=BYTES, value=13005026723}, totalDeleteFileSizeInBytes=CounterResult{unit=BYTES, value=0}, skippedDataFiles=CounterResult{unit=COUNT, value=0}, skippedDeleteFiles=CounterResult{unit=COUNT, value=0}, scannedDeleteManifests=CounterResult{unit=COUNT, value=0}, skippedDeleteManifests=CounterResult{unit=COUNT, value=0}, indexedDeleteFiles=CounterResult{unit=COUNT, value=0}, equalityDeleteFiles=CounterResult{unit=COUNT, value=0}, positionalDeleteFiles=CounterResult{unit=COUNT, value=0}}, metadata={iceberg-version=Apache Iceberg 1.4.2 (commit f6bb9173b13424d77e7ad8439b5ef9627e530cb2)}}
2024-03-10 15:59:29 INFO  org.apache.spark.sql.delta.DeltaLog:60 - No delta log found for the Delta table at s3://<redacted>/data/_delta_log
2024-03-10 15:59:29 INFO  org.apache.spark.sql.delta.InitialSnapshot:60 - [tableId=23f44368-3cd6-451f-af9a-d87c7cd8ee9a] Created snapshot InitialSnapshot(path=s3://<redacted>/data/_delta_log, version=-1, metadata=Metadata(f152cc9d-0cd4-4692-bc95-68866870b6ef,null,null,Format(parquet,Map()),null,List(),Map(),Some(1710086369624)), logSegment=LogSegment(s3://<redacted>/data/_delta_log,-1,List(),None,-1), checksumOpt=None)
2024-03-10 15:59:29 ERROR io.onetable.spi.sync.TableFormatSync:78 - Failed to sync snapshot
io.onetable.exception.NotSupportedException: Unsupported type: timestamp_ntz
        at io.onetable.delta.DeltaSchemaExtractor.toOneSchema(DeltaSchemaExtractor.java:274) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaSchemaExtractor.lambda$toOneSchema$9(DeltaSchemaExtractor.java:203) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_342]
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_342]
        at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:1.8.0_342]
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) ~[?:1.8.0_342]
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) ~[?:1.8.0_342]
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) ~[?:1.8.0_342]
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_342]
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) ~[?:1.8.0_342]
        at io.onetable.delta.DeltaSchemaExtractor.toOneSchema(DeltaSchemaExtractor.java:217) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaSchemaExtractor.toOneSchema(DeltaSchemaExtractor.java:145) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient$TransactionState.addColumn(DeltaClient.java:242) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient$TransactionState.access$200(DeltaClient.java:217) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient.syncPartitionSpec(DeltaClient.java:170) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.spi.sync.TableFormatSync.getSyncResult(TableFormatSync.java:158) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.spi.sync.TableFormatSync.syncSnapshot(TableFormatSync.java:70) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.client.OneTableClient.syncSnapshot(OneTableClient.java:179) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.client.OneTableClient.sync(OneTableClient.java:116) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.utilities.RunSync.main(RunSync.java:168) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
2024-03-10 15:59:29 ERROR io.onetable.client.OneTableClient:133 - Sync failed for the following formats DELTA

ForeverAngry avatar Mar 10 '24 16:03 ForeverAngry

@ForeverAngry I missed a spot but just updated the branch to handle this error you are seeing.

the-other-tim-brown avatar Mar 11 '24 00:03 the-other-tim-brown