hive HIVE-28028: Remove duplicated proto reader/writer classes introduced in HIVE-19288

What changes were proposed in this pull request?

Please check JIRA comments for discussion HIVE-28028

Why are the changes needed?

There are duplicate java files present in hive which are already present in apache tez. I checked the diff between these files from hive and tez and found there are few improvement in TEZ (like TEZ-4296, TEZ-4105, TEZ-4305)

Does this PR introduce any user-facing change?

NO

Is the change a dependency upgrade?

NO

How was this patch tested?

On local mac

Feb 10 '24 06:02 Aggarwal-Raghav

Quality Gate passed

Issues
0 New issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

Feb 10 '24 07:02 sonarqubecloud[bot]

Just for clarification. I recently reviewed #5036, which would rely on these files. I expect this PR will not require #5036 to be modified as the class in Apache Tez has the same fully qualified class name and it is compatible with the one in the Apache Hive repo.

Feb 11 '24 04:02 okumin

Just for clarification. I recently reviewed #5036, which would rely on these files. I expect this PR will not require #5036 to be modified as the class in Apache Tez has the same fully qualified class name and it is compatible with the one in the Apache Hive repo.

Thanks for the review @okumin. Yes, the reason for this PR is to remove redundant code from Hive. After removal of these files, the functionality should work as is. Hence no changes required in #5036

Feb 11 '24 05:02 Aggarwal-Raghav

@abstractdog, can you please help review this?

Feb 20 '24 07:02 Aggarwal-Raghav

@ayushtkn, If I remove all the exclusions from the tez-protobuf-history-plugin, then no log4j or any hadoop dependency is coming. Attaching the dependency tree for reference. So, should we keep exclusions just to be on the safe side?

hive-unit_dependency_tree.txt ql_dependency_tree.txt

Mar 13 '24 17:03 Aggarwal-Raghav

Have updated the PR, considered the suggestions from Laszlo, and have excluded hadoop dependencies which are used in https://github.com/apache/tez/blob/rel/release-0.10.3/tez-plugins/tez-protobuf-history-plugin/pom.xml

Mar 13 '24 18:03 Aggarwal-Raghav

I think you haven't all the excluded dependencies, earlier you had hadoop-mapreduce-client-core and others excluded in ql/pom.xml but now in the parent pom, they aren't excluded, why?

Will add the rest

Mar 14 '24 05:03 Aggarwal-Raghav

The exercise to move all dependencies to hive project level common pom.xml might require more work. There already seems to be a lot of tez related dependencies that are specified in the hive project root pom.xml with no exclusions currently. The exclusions are specifically in the ql module pom.xml, the one place that i saw. moving them up might lead to change in state of other places these dependencies are included and i think should be a larger exercise and we should create another jira for the same with more experts with runtime expertise taking that forward.

For the sake of being optimist, @Aggarwal-Raghav lets just give it a shot if moving them up is easy and build runs else we can create another jira to get that in.

@abstractdog / @ayushtkn The goal here is to remove common transitive dependencies from tez also in hive i suppose and prevent further classpath inconsistencies ?

Mar 14 '24 07:03 anishek

Moving all the dependencies will be a big refactor, we should definitely do it as part of a new jira, not in this. The only dependency we are adding here is "tez-protobuf-history-plugin", I think @abstractdog mentioned that we move that to the main pom & have the exclusions for it defined in the main pom, not the existing ones, for the existing ones we can create a followup if needed.

if for the "new" dependency being added here, if defining it & its exclusion in the main pom doesn't work, I am ok with following the old approach of independently defining it is ql & itests and the exclusions there.

Mar 14 '24 07:03 ayushtkn

I believe we're quite close, also I admit that we don't have a clear policy regarding the tez dependencies and where to exclude hadoop stuff from there, so I believe, without refactoring the whole pom chaos, we should still do here what makes the most sense, which is I tried to suggest, so:

include a tez dependency with all its exclusions in the root pom.xml's dependency management
refer to it from child pom.xml where it's needed

+1 ignore detailed history now, just exclude the typical ones:

      <exclusions>
        <exclusion>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-common</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-mapreduce-client-core</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-mapreduce-client-common</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-hdfs</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-yarn-client</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.slf4j</groupId>
          <artifactId>slf4j-log4j12</artifactId>
        </exclusion>
        <exclusion>
          <groupId>commons-logging</groupId>
          <artifactId>commons-logging</artifactId>
        </exclusion>
      </exclusions>

does this make sense?