hudi
hudi copied to clipboard
[HUDI-6760] Add SelfDescribingInputFormatInterface for hive FileInput…
…Format
Change Logs
Currently, After doing schema evalution using spark-sql, query using hive will fail
-- spark-sql
set hoodie.schema.on.read.enable=true;
create table hudi_mor_test_tbl (
id bigint,
name string,
ts int,
dt string,
hh string
) using hudi
tblproperties (
type = 'mor',
primaryKey = 'id',
preCombineField = 'ts'
)
partitioned by (dt, hh);
insert into hudi_mor_test_tbl values (1, 'a1', 1001, '2021-12-09', '10');
ALTER TABLE hudi_mor_test_tbl ALTER COLUMN ts TYPE bigint;
-- hive
select * from hudi_mor_test_tbl_rt;
Failed with exception java.io.IOException:java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.IntWritable
The root cause is that FileInputFormat does not implement SelfDescribingInputFormatInterface, see
/**
* Marker interface to indicate a given input format is self-describing and
* can perform schema evolution itself.
*/
public interface SelfDescribingInputFormatInterface {
}
Impact
After doing schema evalution using spark-sql, query using hive will success
Risk level (write none, low medium or high below)
none
Documentation Update
none
Contributor's checklist
- [ ] Read through contributor's guide
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
@Zouxxyy Can you elaborate a little more what the purpose of this change? Does it has risk of breaking the compatibility for low version Hive?
@danny0405
Can you elaborate a little more what the purpose of this change?
See updated Change Logs.
Does it has risk of breaking the compatibility for low version Hive?
This interface (SelfDescribingInputFormatInterface ) has existed since hive 2.0, and there is no compatibility problem
@xushiyan @bvaradar Can someone help to understand why hudi-spark-common cannot automatically depend on hive-exec in hudi-hadoop-mr ?
mvn dependency:tree -pl hudi-spark-datasource/hudi-spark-common -Dspark2
[INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) @ hudi-spark-common_2.12 ---
[INFO] org.apache.hudi:hudi-spark-common_2.12:jar:0.15.0-SNAPSHOT
[INFO] +- org.apache.hudi:hudi-hive-sync:jar:0.15.0-SNAPSHOT:compile
[INFO] | +- org.apache.hudi:hudi-hadoop-mr:jar:0.15.0-SNAPSHOT:compile
[INFO] | +- org.apache.hudi:hudi-sync-common:jar:0.15.0-SNAPSHOT:compile
here is the error in integration-tests, don't know much about the env of integration testing, can anyone help~
2023-09-08T05:11:59.7764700Z Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.io.SelfDescribingInputFormatInterface
2023-09-08T05:11:59.7764906Z at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
2023-09-08T05:11:59.7765092Z at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
2023-09-08T05:11:59.7765284Z at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
2023-09-08T05:11:59.7765373Z ... 58 more
2023-09-08T05:11:59.7765560Z 23/09/08 05:11:59 INFO util.ShutdownHookManager: Shutdown hook called
2023-09-08T05:11:59.7766126Z 23/09/08 05:11:59 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-b81218a3-32e6-4851-9b25-b15373acd05b
2023-09-08T05:11:59.7766507Z 23/09/08 05:11:59 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-9b58a267-201d-4404-baeb-49e617b23ad1
2023-09-08T05:11:59.7766647Z
2023-09-08T05:11:59.7766919Z Sep 08, 2023 5:11:59 AM org.glassfish.jersey.internal.Errors logErrors
2023-09-08T05:11:59.7767534Z WARNING: The following warnings have been detected: WARNING: Cannot create new registration for component type class com.fasterxml.jackson.jaxrs.json.JacksonJsonProvider: Existing previous registration found for the type.
2023-09-08T05:11:59.7767548Z
2023-09-08T05:11:59.7768090Z [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 96.75 s <<< FAILURE! - in org.apache.hudi.integ.command.ITTestHoodieSyncCommand
You can take a look at the readme: https://github.com/apache/hudi/tree/master/hudi-integ-test
@Zouxxyy you need to run the exact same commands as shown in the logs in the docker environment to debug the failed integration. It looks like the HiveSyncTool spark job fails due to class not found. Likely the new class, SelfDescribingInputFormatInterface, is not included in the bundle.
Also wondering, how does SelfDescribingInputFormatInterface automatically fix the schema evolution (I don't see any API implemented)?
@xushiyan @bvaradar Can someone help to understand why
hudi-spark-commoncannot automatically depend onhive-execinhudi-hadoop-mr?mvn dependency:tree -pl hudi-spark-datasource/hudi-spark-common -Dspark2 [INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) @ hudi-spark-common_2.12 --- [INFO] org.apache.hudi:hudi-spark-common_2.12:jar:0.15.0-SNAPSHOT [INFO] +- org.apache.hudi:hudi-hive-sync:jar:0.15.0-SNAPSHOT:compile [INFO] | +- org.apache.hudi:hudi-hadoop-mr:jar:0.15.0-SNAPSHOT:compile [INFO] | +- org.apache.hudi:hudi-sync-common:jar:0.15.0-SNAPSHOT:compile
I think this dependency is already the case?
@yihua see https://github.com/apache/hudi/pull/7129 , it turns out that this question has already been raised
I think he had the same problem as me
The hive dependency is not passed
@Zouxxyy have you figured out why integration tests failed in the GH actions?
@Zouxxyy : Can you rebase and resolve the merge conflict ? We can take a look at the test failure after that
Fixed conflicts and rebased. Made minor changes to align with latest code.
CI report:
- db65ca67c67d44d19cd7a56b734f846466feead6 Azure: SUCCESS
Bot commands
@hudi-bot supports the following commands:@hudi-bot run azurere-run the last Azure build
@bvaradar Sorry for delay, thanks for your help, It seems that the CI is not stable.
Rerunning failed jobs