hive icon indicating copy to clipboard operation
hive copied to clipboard

HIVE-28373: Iceberg: Refactor the code of HadoopTableOptions

Open BsoBird opened this issue 1 year ago • 19 comments

PR: core:Refactor the code of HadoopTableOptions https://github.com/apache/iceberg/pull/10623 All test cases can be found in this PR HIVE-ISSUE: https://issues.apache.org/jira/browse/HIVE-28373

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Is the change a dependency upgrade?

How was this patch tested?

BsoBird avatar Jul 15 '24 09:07 BsoBird

@BsoBird, add original iceberg class to the excludes list

<excludes>
    **/HadoopInputFile.class
+   **/HadoopTableOperations.class
    **/SerializableTable.class
</excludes>

deniskuzZ avatar Jul 15 '24 10:07 deniskuzZ

@BsoBird, add original iceberg class to the excludes list

<excludes>
    **/HadoopInputFile.class
+   **/HadoopTableOperations.class
    **/SerializableTable.class
</excludes>

done

BsoBird avatar Jul 15 '24 10:07 BsoBird

can you explain the changes & the issues being solved here? Is there a possibility to extend some tests, if it is fixing any bug or so?

ayushtkn avatar Jul 15 '24 13:07 ayushtkn

can you explain the changes & the issues being solved here? Is there a possibility to extend some tests, if it is fixing any bug or so?

This PR solves the reliability problem of hadoopcatalog.With this PR. hadoopcatalog commits are atomic operations.

I submitted this PR to the iceberg community, but the iceberg community did not accept this PR, they don't want to support FileSystemCatalog further.

But HIVE has scenarios where external tables are created using hadoop_catalog, so I think at least this PR needs to be merged into HIVE.

All test cases can be viewed in the ICEBERG Community PR. https://github.com/apache/iceberg/pull/10623/files#diff-5eae15702389a4a8e2eff932cd5567ac53d68941833556e52c688de77a03fa4e

BsoBird avatar Jul 15 '24 14:07 BsoBird

All test cases can be viewed in the ICEBERG Community PR. https://github.com/apache/iceberg/pull/10623/files#diff-5eae15702389a4a8e2eff932cd5567ac53d68941833556e52c688de77a03fa4e

there are test files for Iceberg in Hive code also, you can port this test in Hive, maybe in this directory: https://github.com/apache/hive/tree/master/iceberg/iceberg-handler/src/test/java/org/apache/iceberg/mr/hive

ayushtkn avatar Jul 16 '24 05:07 ayushtkn

@ayushtkn Sir, I tried to port the test cases, but I found out that we are placing the iceberg patch in the patch directory, which has to be compiled for the patch to take effect, and it is not possible to test the patch separately.But this patch is very difficult to reproduce the bug directly from outside via SQL. So, what should i do?

BsoBird avatar Jul 17 '24 07:07 BsoBird

build failed with indentation checks - checkstyle stuff

as for tests, can't we also move them to patched modules as well?

deniskuzZ avatar Jul 23 '24 14:07 deniskuzZ

build failed with indentation checks - checkstyle stuff

as for tests, can't we also move them to patched modules as well?

Sir, it's difficult to test this class directly in the HIVE code because of the special compilation method we use to integrate the patch into HIVE. But let me try. I'll also fix the checkstyle related problems.

BsoBird avatar Jul 23 '24 14:07 BsoBird

@zhangbutao @deniskuzZ @ayushtkn Sir. I submitted the relevant patch and ported the relevant test-case. Also, this PR may rely on a modification of HIVE-28421.

BsoBird avatar Aug 01 '24 05:08 BsoBird

@deniskuzZ Sir, when this PR is merged, can you retest HIVE-28366 with hadoopcatalog?

BsoBird avatar Aug 02 '24 09:08 BsoBird

@BsoBird, have you considered using HiveCatalog (backed by HMS) instead of HadoopCatalog? Wonder if there were some restrictions or user specific use-cases.

deniskuzZ avatar Aug 03 '24 16:08 deniskuzZ

@BsoBird, have you considered using HiveCatalog (backed by HMS) instead of HadoopCatalog? Wonder if there were some restrictions or user specific use-cases.

@deniskuzZ Hi. We can't use HMScatalog to replace fsCatalog at the moment, because in our production environment, we are currently running multiple versions of HIVE/SPARK at the same time. HIVE3+SPARK serves some of the users, and HIVE4 serves some others. But they need to share the same ICEBERG data.So, we have to choose a lightweight catalog for sharing data between different computing engines.

BsoBird avatar Aug 05 '24 02:08 BsoBird

@deniskuzZ

In order to ensure the success of upgrading from HIVE3 to HIVE4, we migrate all the base data from HIVE to ICEBERG.(Data is not dependent on compute engine management) In this way, we can avoid the failure of upgrading HIVE, which may lead to a series of other problems such as data loss.I believe this is the preferred option for most users to upgrade their HIVE version or other computing engines in the future.

I'm sure that other users have encountered problems similar to mine when upgrading to HIVE4. For example: Trino/spark/starRocks/impala and other third-party systems rely on a lower version of HIVE, but HIVE itself has the need to upgrade to version 4.X. This will result in multiple versions of HIVE coexisting. On this basis, the only way to share the same iceberg data is to choose a non-HMS-CATALOG implementation. In order to reduce the complexity of the component stack, choosing hadoopCatalog is of course the first option to consider.

(tips: I'm not sure I'd consider using rest-catalog at the moment. First of all, the rest client is not only slow to fetch the full amount of data using paging, but it can also lose data. The data can't be consolidated in a single metadata system, and it lacks a reliable official implementation.)

BsoBird avatar Aug 05 '24 02:08 BsoBird

@BsoBird, thanks for the detailed explanation. Hive-3 & Hive-4 could have schema incompatibility some unrelated test failures, restarted the build

deniskuzZ avatar Aug 05 '24 08:08 deniskuzZ

HIVE-28421 has been merged . I think we should rebase to run the tests right?

simhadri-g avatar Aug 27 '24 08:08 simhadri-g

HIVE-28421 has been merged . I think we should rebase to run the tests right?

I've restarted the build

deniskuzZ avatar Aug 27 '24 09:08 deniskuzZ

tests passed, however, sonar reported a number of issues that need to be resolved cc @BsoBird

deniskuzZ avatar Aug 28 '24 10:08 deniskuzZ

@deniskuzZ done

BsoBird avatar Aug 29 '24 06:08 BsoBird

@deniskuzZ Sir, if you have time, could you recheck this PR?Tks.

BsoBird avatar Sep 04 '24 04:09 BsoBird

Tks!

BsoBird avatar Sep 11 '24 14:09 BsoBird