iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Build and test hive-metastore with Hive 2, 3 and 4 with a single source set

Open wypoon opened this issue 7 months ago • 39 comments

We define new modules, hive3-metastore and hive4-metastore, that depend on Hive 3.1.3 and Hive 4.0.1 respectively for their Hive dependencies. The existing hive-metastore module continues to depend on Hive 2.3.10.

We use a single source set for all three modules. In order to achieve this, to workaround HIVE-27925 which introduced a backwards incompatibility, instead of using HiveConf.ConfVars enums we use the configuration property names. We also need to work extra hard to achieve a single version of TestHiveMetastore that works for Hive 2, 3 and 4. TestHiveMetastore is a test utility class that stands up a HMS for testing the HiveCatalog.

For this PR, we do not touch other modules (such as Flink and Spark modules) that depend on TestHiveMetastore for testing. Those modules continue to depend on hive-metastore, built with Hive 2.3.10.

NOTE: This is an alternative to https://github.com/apache/iceberg/pull/12681.

wypoon avatar Apr 04 '25 06:04 wypoon

@wypoon: Could you please highlight the difference which is needed for the TestHiveClientPool?

pvary avatar Apr 07 '25 09:04 pvary

@wypoon: Could we add a CI change which runs the HMS tests against Hive3, and Hive4?

pvary avatar Apr 10 '25 10:04 pvary

@wypoon Since we are leaning towards this approach, could you please update the PR description to summarize the changes here?

manuzhang avatar Apr 10 '25 13:04 manuzhang

@wypoon: Could we add a CI change which runs the HMS tests against Hive3, and Hive4?

They are already run by the CI. I did not change the existing github workflows. The java-ci (core-tests) runs all non-Flink, non-Spark, non-Kafka-connect tests. Since hive3-metastore and hive4-metastore are top-level modules, they get built and their tests get run.

See, e.g., https://github.com/apache/iceberg/actions/runs/14386854404/job/40344302132?pr=12721 (before it expires).

wypoon avatar Apr 10 '25 16:04 wypoon

@wypoon: We are adding a few modules to the project. I think it would be prudent to send a mail about it to the dev list, but I think we are good to go.

Thanks, Peter

pvary avatar Apr 11 '25 13:04 pvary

LGTM except for this description.

We keep all of the source files the same for all three modules.

I'd suggest All three modules share a single source set. to be less ambiguous.

manuzhang avatar Apr 11 '25 14:04 manuzhang

@pvary thanks for all your help. @manuzhang thanks for your feedback too.

I sent an update to the dev list.

wypoon avatar Apr 11 '25 16:04 wypoon

@wypoon Is this adding new artifacts as well? If the purpose is to have a single source set and validate across versions, we shouldn't be producing artifacts across all versions.

danielcweeks avatar Apr 11 '25 18:04 danielcweeks

@wypoon Is this adding new artifacts as well? If the purpose is to have a single source set and validate across versions, we shouldn't be producing artifacts across all versions.

@danielcweeks thank you for reviewing. You mean that we should avoid producing iceberg-hive3-metastore-xxx.jar and iceberg-hive4-metastore-xxx.jar? (as they should contain the same classes as iceberg-hive-metastore-xxx.jar) And your concern is that we should not publish them?

wypoon avatar Apr 11 '25 19:04 wypoon

@wypoon Is this adding new artifacts as well? If the purpose is to have a single source set and validate across versions, we shouldn't be producing artifacts across all versions.

@danielcweeks thank you for reviewing. You mean that we should avoid producing iceberg-hive3-metastore-xxx.jar and iceberg-hive4-metastore-xxx.jar? (as they should contain the same classes as iceberg-hive-metastore-xxx.jar) And your concern is that we should not publish them?

I have updated the build to not produce jars for hive3-metastore and hive4-metastore.

wypoon avatar Apr 11 '25 22:04 wypoon

Hey @wypoon I think we're approaching this the wrong way. Rather than defining new projects to represent hive3&4, we can actually just run the tests against those dependencies by defining new dependency configurations.

At the configuration section just add:

  configurations {
    testImplementation.extendsFrom compileOnly

    . . .

    testArtifacts
    testHive2
    testHive3
    testHive4
  }

Then just define the dependencies for these configurations in the current hive project:

project(':iceberg-hive-metastore') {

  dependencies {

    testHive2("${libs.hive2.exec.get().module}:${libs.hive2.exec.get().getVersion()}:core") {

    . . .

    testHive3("${libs.hive3.exec.get().module}:${libs.hive3.exec.get().getVersion()}:core") {
      exclude group: 'org.apache.avro', module: 'avro'
      exclude group: 'org.slf4j', module: 'slf4j-log4j12'
      exclude group: 'org.pentaho' // missing dependency
      exclude group: 'org.apache.hive', module: 'hive-llap-tez'
      exclude group: 'org.apache.logging.log4j'
      exclude group: 'com.google.protobuf', module: 'protobuf-java'
      exclude group: 'org.apache.calcite'
      exclude group: 'org.apache.calcite.avatica'
      exclude group: 'com.google.code.findbugs', module: 'jsr305'
    }

    testHive3(libs.hive3.metastore) {
      exclude group: 'org.apache.avro', module: 'avro'
      exclude group: 'org.slf4j', module: 'slf4j-log4j12'
      exclude group: 'org.pentaho' // missing dependency
      exclude group: 'org.apache.hbase'
      exclude group: 'org.apache.logging.log4j'
      exclude group: 'co.cask.tephra'
      exclude group: 'com.google.code.findbugs', module: 'jsr305'
      exclude group: 'org.eclipse.jetty.aggregate', module: 'jetty-all'
      exclude group: 'org.eclipse.jetty.orbit', module: 'javax.servlet'
      exclude group: 'org.apache.parquet', module: 'parquet-hadoop-bundle'
      exclude group: 'com.tdunning', module: 'json'
      exclude group: 'javax.transaction', module: 'transaction-api'
      exclude group: 'com.zaxxer', module: 'HikariCP'
    }

    testHive4(libs.hive4.metastore) {

    . . . 

Then add the test tasks:

  tasks.register('testHive2', Test) {
    classpath = sourceSets.test.runtimeClasspath + configurations.testHive2
    testClassesDirs = sourceSets.test.output.classesDirs
    description = 'Run tests with Hive 2 dependencies'
  }

  tasks.register('testHive3', Test) {
    classpath = sourceSets.test.runtimeClasspath + configurations.testHive3
    testClassesDirs = sourceSets.test.output.classesDirs
    description = 'Run tests with Hive 3 dependencies'
  }

  tasks.register('testHive4', Test) {
    classpath = sourceSets.test.runtimeClasspath + configurations.testHive4
    testClassesDirs = sourceSets.test.output.classesDirs
    description = 'Run tests with Hive 4 dependencies'
  }

This makes it a lot cleaner and means we don't have to worry about weird workarounds for the additional two projects.

danielcweeks avatar Apr 14 '25 18:04 danielcweeks

@danielcweeks thank you for looking into this. I have considered something along the lines of your suggestion before. I had implemented something similar in https://github.com/apache/iceberg/pull/2512, following the input of the community at the time, when we first added support for Spark 3.1. The idea was to build using Spark 3.0 but test using Spark 3.1 as well by defining a configuration within the module. I believe this is your idea here as well: build using Hive 2, and test using Hive 2, 3 and 4. The problem with that was that by not building against Spark 3.1, we failed to discover some issue (the RepartitionByExpression constructor changed signature in 3.1 and I'd failed to replace one occurrence of RepartitionByExpression by the necessary reflection in a code path that was not exercised by tests). Subsequently someone tried to add tests that exercised that code path and I had to put up https://github.com/apache/iceberg/pull/2954. As you know, subsequently the community decided to follow a different approach when the time came to support Spark 3.2. Considering this, I really think it is best to build the module against all versions and test accordingly. While I may flushed out issues in the process of working through this PR to achieve a source set that appears to build and run successfully with Hive 2, 3, and 4, that is not a guarantee that new issues may not manifest when we upgrade the Hive 4 version, for example, that would be caught by building against it.

(That said, I tried to follow the suggestion you outlined, but it didn't quite work as-is, and I tried to fix it, but without success thus far due to my limited Gradle knowledge. I'm continuing to look into it.)

wypoon avatar Apr 16 '25 23:04 wypoon

@wypoon I was able to get this up and running based on your branch, so i think we can make it work.

As for the comparison to the Spark distribution, I don't think we want to go that route. I would rather we rely on the backward compatibility of the metastore or start dropping support for older versions as opposed to starting to produce multiple binaries that we need to maintain.

danielcweeks avatar Apr 17 '25 21:04 danielcweeks

@danielcweeks I think you misunderstand me. I'm not advocating doing what the Spark and Flink modules now do, which is separate source sets and artifacts for each version. I have worked hard to get the hive-metastore module build successfully against Hive 2, 3 and 4 using a single source set. I think we should maintain a single source set if possible, but I think we should build, not just test against Hive 2, 3 and 4. And I do agree about not producing multiple binaries. I have already avoided doing that.

My point in bringing up what we did with Spark 3.0 and 3.1 is that by only building against 3.0 and not 3.1, we failed to discover an incompatibility that would have been found if we were building against both 3.0 and 3.1.

wypoon avatar Apr 17 '25 23:04 wypoon

@wypoon I appreciate the effort put into getting this working, but I'm concerned about the approach, complexity, and unintended impacts this approach may have.

I see the goal here as ensuring that we have runtime compatibility, not source-level compatibility across the different hive versions. If we want both source and runtime compatibility, we should just produce separate artifacts and have separate source/builds (not try to manage the same source across subtly different incompatible versions). I don't think we want to go this route.

For example, I did a quick test to see what impact this would have on publishing and it causes publish to fail:

* What went wrong:
Execution failed for task ':iceberg-hive3-metastore:publishApachePublicationToMavenLocal'.
> Failed to publish publication 'apache' to repository 'mavenLocal'
   > Artifact iceberg-hive3-metastore-1.8.0-SNAPSHOT.jar wasn't produced by this build.

I think @pvary's comment hits on some of the same concerns.

I'm happy to help get this up and running with test dependencies.

danielcweeks avatar Apr 18 '25 17:04 danielcweeks

@danielcweeks we are in agreement that we should produce a single set of artifacts for the hive-metastore module, that should have runtime compatibility across supported Hive versions. And I agree that the failure in publish is a problem. (The approach we took when we supported Spark 3.0 and 3.1 with a single set of artifacts was also to support runtime compatibility. I would just point out that the issue that was later discovered was a runtime compatibility issue; it was not discovered because there were no tests exercising the code path where that incompatibility occurred. However, the issue could have been discovered sooner if we had built against Spark 3.1.) I'd appreciate your help if you could open a PR against my branch with the gradle changes that you were able to get working. However, I'd propose that instead of building against Hive 2, we should build against Hive 4.0, and test against Hive 2.3, 3.1 and 4.0. Hive 2.3 and 3.1 are both EOL. Hive 4 is what will continue to change. By building against Hive 4, rather than an EOL version that won't change, we will catch any potential future breakage. This should also confirm that we do in fact have the backward compatibility that we plan to rely on, as Flink and Spark modules depend on the hive-metastore module and on Hive 2.3 currently. What do you think? (@pvary your thoughts too?)

wypoon avatar Apr 22 '25 01:04 wypoon

@danielcweeks I rebased on main. When you find the time, please open a PR against my branch or otherwise post the gradle changes you're suggesting. Thanks.

wypoon avatar Apr 25 '25 05:04 wypoon

I think it is a good direction to build a single iceberg-hive-metastore artifact and test it against Hive 2.3.10/3.1.2/4.0.1. We could do an upgrade to Hive 4 as the dependency is only compile time dependency, but only if the tests are running against all of the supported Hive versions.

pvary avatar Apr 30 '25 11:04 pvary

@wypoon I was out for a bit, but back looking at this. I'm fine with updating to use Hive 4 as the primary target and test against the other versions. Let me see if I can get a PR running against your branch.

danielcweeks avatar May 01 '25 16:05 danielcweeks

Thanks @danielcweeks! It's also fine if you just keep Hive 2 as the primary target in your change (as we know Hive 2 works) and I can then switch it to Hive 4 and iron out any further issues that may arise.

wypoon avatar May 01 '25 16:05 wypoon

Hey @wypoon, I spent some time working on getting the test only approach working, but ran into some other issues that I'm worried are even bigger problems with our Hive compatibility story.

The issue is that while we are source compatible across hive2/3/4 (it compiles), our artifacts and test artifacts are not compatible across these versions. That means the original approach of having subprojects that all compile and test against the single source tree, hides the binary incompatibility. Since we are not producing separate artifacts, there's no guarantee that the binaries produced will work in a hive3/4.

The main issue in Hive 3/4 is the HiveSchemaUtil which results in the following if you use the binary artifacts across versions:

java.lang.NoSuchMethodError: 'java.util.ArrayList org.apache.hadoop.hive.serde2.typeinfo.StructTypeInfo.getAllStructFieldNames()'
	at org.apache.iceberg.hive.HiveSchemaConverter.convertType(HiveSchemaConverter.java:138)
	at org.apache.iceberg.hive.HiveSchemaConverter.convertInternal(HiveSchemaConverter.java:71)
	at org.apache.iceberg.hive.HiveSchemaConverter.convert(HiveSchemaConverter.java:55)
	at org.apache.iceberg.hive.HiveSchemaUtil.convert(HiveSchemaUtil.java:78)
	at org.apache.iceberg.hive.HiveSchemaUtil.convert(HiveSchemaUtil.java:56)

So either approach we use to test doesn't actually provide compatibility other than knowing if you compiled for the specific version of Hive, the binary artifacts would work.

danielcweeks avatar May 06 '25 19:05 danielcweeks

@danielcweeks thanks for the update. In this case, it appears that org.apache.hadoop.hive.serde2.typeinfo.StructTypeInfo.getAllStructFieldNames() has changed its return type from java.util.ArrayList to java.util.List in Hive 4.0. Given that binary compatibility is not guaranteed if the hive-metastore module is compiled against one major version of Hive and run against a different major version, should we build and publish separate artifacts for each major version we support? I know that we bundle the hive-metastore classes in the Flink and the Spark runtime JARs. In https://github.com/apache/iceberg/pull/12693, I had verified that Flink works just fine running against hive-metastore built with Hive 4. So, if the Flink Iceberg user base is amenable, we can ship hive-metastore built with Hive 4 in the Flink runtime JARs. Spark does not currently support running against Hive 4 HMS, and still uses Hive 2.3 in its built-in Hive support, so we can ship hive-metastore built with Hive 2 in the Spark runtime JARs. What are your thoughts? And @pvary?

wypoon avatar May 06 '25 22:05 wypoon

@pvary, to clarify: In https://github.com/apache/iceberg/pull/12693, we have 3 modules, hive-metastore, hive3-metastore and hive4-metastore, built against Hive 2, 3 and 4 respectively. By "I had verified that Flink works just fine running against hive-metastore built with Hive 4", I mean that in the Flink modules, I changed the dependency on hive-metastore to hive4-metastore and the Hive dependencies to Hive 4, and the Flink tests all passed.

wypoon avatar May 07 '25 16:05 wypoon

On reflection, I agree that if we do build and publish multiple versions of iceberg-hive-metastore based on Hive version, that it makes sense to remove the iceberg-hive-metastore classes from the runtime JARs, and that users would then need to add the version of iceberg-hive-metastore they want to use with the engine. This allows them flexibility but they lose the convenience of a single JAR.

wypoon avatar May 07 '25 16:05 wypoon

This mean we have the following options:

  1. Limit the supported Hive versions
  2. Create hive-metastore jars to the different Hive versions, and remove them from the runtime jars.
  3. Release a matrix of flink-runtime/spark-runtime jars with different hive versions

I think we should do 2. However, I would defer the second part of 2 to a separate change. At least, I'd seek input on the dev list before implementing that second part. I'm going to update this PR. However, there is a conflict introduced by Huaxin's implementation of Spark 4 support (there is a change in TestHiveMetastore.java), and I need to rebase on main. I learn though that she plans to revert her commit in order to redo it as 3 commits (to preserve commit history in the files for the most recent Spark version), so I'm having to hold off.

wypoon avatar May 08 '25 20:05 wypoon

@pvary I have rebased on main. I have restored building jars for the hive3-metastore and hive4-metastore modules. For now, I have left the Flink and Spark runtime jars unchanged. @danielcweeks are you ok with the approach, given the incompatibility you found when you tried to use a hive-metastore module built against a single Hive version?

wypoon avatar May 15 '25 06:05 wypoon

I think we should take this back to the dev list and discuss since we've gone from just testing to producing new artifacts. This could potentially affect other area of the project that bundle hive (like Kafka connect).

danielcweeks avatar May 15 '25 22:05 danielcweeks

@danielcweeks are you ok with this as it stands now? This will produce 3 sets of jars, and we continue keeping the classes built against Hive 2 in the runtime jars of engines. When we are ready for 2.0, we will remove the classes from the runtime jars.

wypoon avatar May 23 '25 00:05 wypoon

@pvary @danielcweeks can we move forward now with this given the discussion in last week's community sync?

wypoon avatar May 28 '25 01:05 wypoon

it appears that org.apache.hadoop.hive.serde2.typeinfo.StructTypeInfo.getAllStructFieldNames() has changed its return type from java.util.ArrayList to java.util.List in Hive 4.0.

@wypoon this specific case can be addressed by reflection invocation, has this been discussed/tried?

pan3793 avatar Jun 10 '25 02:06 pan3793