hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-6760] Add SelfDescribingInputFormatInterface for hive FileInput…

Open Zouxxyy opened this issue 2 years ago • 13 comments

…Format

Change Logs

Currently, After doing schema evalution using spark-sql, query using hive will fail

-- spark-sql
set hoodie.schema.on.read.enable=true;

create table hudi_mor_test_tbl (
  id bigint,
  name string,
  ts int,
  dt string,
  hh string
) using hudi
tblproperties (
  type = 'mor',
  primaryKey = 'id',
  preCombineField = 'ts'
)
partitioned by (dt, hh);

insert into hudi_mor_test_tbl values (1, 'a1', 1001, '2021-12-09', '10');

ALTER TABLE hudi_mor_test_tbl ALTER COLUMN ts TYPE bigint;

-- hive
select * from hudi_mor_test_tbl_rt;

Failed with exception java.io.IOException:java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.IntWritable

The root cause is that FileInputFormat does not implement SelfDescribingInputFormatInterface, see

/**
 * Marker interface to indicate a given input format is self-describing and
 * can perform schema evolution itself.
 */
public interface SelfDescribingInputFormatInterface {

}

Impact

After doing schema evalution using spark-sql, query using hive will success

Risk level (write none, low medium or high below)

none

Documentation Update

none

Contributor's checklist

  • [ ] Read through contributor's guide
  • [ ] Change Logs and Impact were stated clearly
  • [ ] Adequate tests were added if applicable
  • [ ] CI passed

Zouxxyy avatar Aug 28 '23 02:08 Zouxxyy

@Zouxxyy Can you elaborate a little more what the purpose of this change? Does it has risk of breaking the compatibility for low version Hive?

danny0405 avatar Aug 28 '23 06:08 danny0405

@danny0405

Can you elaborate a little more what the purpose of this change?

See updated Change Logs.

Does it has risk of breaking the compatibility for low version Hive?

This interface (SelfDescribingInputFormatInterface ) has existed since hive 2.0, and there is no compatibility problem

Zouxxyy avatar Aug 28 '23 08:08 Zouxxyy

@xushiyan @bvaradar Can someone help to understand why hudi-spark-common cannot automatically depend on hive-exec in hudi-hadoop-mr ?

 mvn dependency:tree -pl hudi-spark-datasource/hudi-spark-common -Dspark2

[INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) @ hudi-spark-common_2.12 ---
[INFO] org.apache.hudi:hudi-spark-common_2.12:jar:0.15.0-SNAPSHOT
[INFO] +- org.apache.hudi:hudi-hive-sync:jar:0.15.0-SNAPSHOT:compile
[INFO] |  +- org.apache.hudi:hudi-hadoop-mr:jar:0.15.0-SNAPSHOT:compile
[INFO] |  +- org.apache.hudi:hudi-sync-common:jar:0.15.0-SNAPSHOT:compile

Zouxxyy avatar Aug 30 '23 01:08 Zouxxyy

here is the error in integration-tests, don't know much about the env of integration testing, can anyone help~

2023-09-08T05:11:59.7764700Z Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.io.SelfDescribingInputFormatInterface
2023-09-08T05:11:59.7764906Z 	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
2023-09-08T05:11:59.7765092Z 	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
2023-09-08T05:11:59.7765284Z 	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
2023-09-08T05:11:59.7765373Z 	... 58 more
2023-09-08T05:11:59.7765560Z 23/09/08 05:11:59 INFO util.ShutdownHookManager: Shutdown hook called
2023-09-08T05:11:59.7766126Z 23/09/08 05:11:59 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-b81218a3-32e6-4851-9b25-b15373acd05b
2023-09-08T05:11:59.7766507Z 23/09/08 05:11:59 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-9b58a267-201d-4404-baeb-49e617b23ad1
2023-09-08T05:11:59.7766647Z 
2023-09-08T05:11:59.7766919Z Sep 08, 2023 5:11:59 AM org.glassfish.jersey.internal.Errors logErrors
2023-09-08T05:11:59.7767534Z WARNING: The following warnings have been detected: WARNING: Cannot create new registration for component type class com.fasterxml.jackson.jaxrs.json.JacksonJsonProvider: Existing previous registration found for the type.
2023-09-08T05:11:59.7767548Z 
2023-09-08T05:11:59.7768090Z [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 96.75 s <<< FAILURE! - in org.apache.hudi.integ.command.ITTestHoodieSyncCommand

Zouxxyy avatar Sep 09 '23 03:09 Zouxxyy

You can take a look at the readme: https://github.com/apache/hudi/tree/master/hudi-integ-test

danny0405 avatar Sep 10 '23 01:09 danny0405

@Zouxxyy you need to run the exact same commands as shown in the logs in the docker environment to debug the failed integration. It looks like the HiveSyncTool spark job fails due to class not found. Likely the new class, SelfDescribingInputFormatInterface, is not included in the bundle.

Also wondering, how does SelfDescribingInputFormatInterface automatically fix the schema evolution (I don't see any API implemented)?

yihua avatar Sep 14 '23 22:09 yihua

@xushiyan @bvaradar Can someone help to understand why hudi-spark-common cannot automatically depend on hive-exec in hudi-hadoop-mr ?

 mvn dependency:tree -pl hudi-spark-datasource/hudi-spark-common -Dspark2

[INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) @ hudi-spark-common_2.12 ---
[INFO] org.apache.hudi:hudi-spark-common_2.12:jar:0.15.0-SNAPSHOT
[INFO] +- org.apache.hudi:hudi-hive-sync:jar:0.15.0-SNAPSHOT:compile
[INFO] |  +- org.apache.hudi:hudi-hadoop-mr:jar:0.15.0-SNAPSHOT:compile
[INFO] |  +- org.apache.hudi:hudi-sync-common:jar:0.15.0-SNAPSHOT:compile

I think this dependency is already the case?

yihua avatar Sep 14 '23 23:09 yihua

@yihua see https://github.com/apache/hudi/pull/7129 , it turns out that this question has already been raised I think he had the same problem as me image The hive dependency is not passed

Zouxxyy avatar Sep 15 '23 01:09 Zouxxyy

@Zouxxyy have you figured out why integration tests failed in the GH actions?

yihua avatar Sep 22 '23 23:09 yihua

@Zouxxyy : Can you rebase and resolve the merge conflict ? We can take a look at the test failure after that

bvaradar avatar Dec 15 '23 19:12 bvaradar

Fixed conflicts and rebased. Made minor changes to align with latest code.

bvaradar avatar Dec 20 '23 21:12 bvaradar

CI report:

  • db65ca67c67d44d19cd7a56b734f846466feead6 Azure: SUCCESS
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar Dec 21 '23 02:12 hudi-bot

@bvaradar Sorry for delay, thanks for your help, It seems that the CI is not stable.

Zouxxyy avatar Dec 21 '23 02:12 Zouxxyy

Rerunning failed jobs

bvaradar avatar Jan 10 '24 04:01 bvaradar