[Bug] throw Hive DDL and paimon schema mismatched excetion when show the table's schema with the timestamp type
Purpose
To fix issue #5450
This PR addresses three issues related to Hive3 compatibility with Paimon tables:
-
Type Validation Failure for Timestamp LTZ Hive3 stores the timestamp with local time zone type as timestamp in its metastore, while Paimon explicitly uses the type name timestamp with local time zone during type conversion. Although Hive3 natively supports this type, the metastore's type name mismatch causes validation failures. We resolve this by allowing type compatibility validation to bypass the naming inconsistency.
-
-8-Hour Timezone Offset in Query Results After fixing the first issue, Hive3 successfully executes DDL operations but returns timestamp values with an incorrect -8-hour offset for the Beijing timezone. The root cause lies in the PaimonTimestampLocalTZObjectInspector.getPrimitiveJavaObject method (in the paimon-hive-connector-3.1 module), which fails to apply the session timezone during deserialization. According to the Parquet specification, timestamp with local time zone fields store UTC timestamps and require conversion to the local timezone during processing. We fix this by ensuring proper UTC-to-session-timezone conversion in the deserialization logic.
-
+8-Hour Offset in SparkSQL Queries After INSERT When inserting Beijing local time into a Hive3-created table with a timestamp with local time zone field, querying via SparkSQL reveals an 8-hour offset. This is caused by Hive incorrectly treating timestamp data as local time during serialization, violating the timestamp with local time zone semantics. Specifically, the PaimonTimestampLocalTZObjectInspector.convert method erroneously converts Hive time data to a local Timestamp type instead of adhering to UTC. The fix involves converting the Hive time data to a UTC timestamp before storing it in Paimon’s timestamp type, ensuring alignment with the expected UTC-based storage semantics.
References: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#instant-semantics-timestamps-normalized-to-utc https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/timezone/#timestamp_ltz-type https://spark.apache.org/docs/latest/sql-ref-datatypes.html#TimestampType https://hive.apache.org/docs/latest/different-timestamp-types_103091503/
Linked issue: open #5450
Tests
-- issue 1: show the schema of the table created by spark is ok
-- show create table is ok
hive> show create table test.data_type_table_ts;
OK
CREATE TABLE `test.data_type_table_ts`(
`id` int COMMENT 'from deserializer',
`ts` timestamp with local time zone COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.apache.paimon.hive.PaimonSerDe'
STORED BY
'org.apache.paimon.hive.PaimonStorageHandler'
LOCATION
'hdfs://nameservice1/user/hive/warehouse/test.db/data_type_table_ts'
TBLPROPERTIES (
'table_type'='PAIMON')
-- issue 2: select rows inserted by spark in hive is ok.
-- select is ok
hive> select * from test.data_type_table_ts;
OK
1 2025-04-13 16:00:00.0 PRC
2 2025-04-13 17:05:38.11 PRC
3 2025-04-15 13:46:30.0 PRC
-- select with functions is ok
hive> select id,from_unixtime(unix_timestamp(ts), 'yyyy-MM-dd HH:mm:ss') from test.data_type_table_ts;
OK
1 2025-04-13 16:00:00
2 2025-04-13 17:05:38
3 2025-04-15 13:46:30
-- issue 3: insert rows in hive and then select rows in hive and spark is ok.
-- create table using hive3 and insert into rows
hive> show create table test.hive_test_table;
OK
CREATE TABLE `test.hive_test_table`(
`a` int COMMENT 'The a field',
`ts` timestamp with local time zone COMMENT 'The ts field')
ROW FORMAT SERDE
'org.apache.paimon.hive.PaimonSerDe'
STORED BY
'org.apache.paimon.hive.PaimonStorageHandler'
WITH SERDEPROPERTIES (
'serialization.format'='1')
LOCATION
'hdfs://nameservice1/user/hive/warehouse/test.db/hive_test_table'
TBLPROPERTIES (
'bucketing_version'='2',
'last_modified_by'='hive',
'last_modified_time'='1745292917',
'transient_lastDdlTime'='1745292917')
hive> insert into test.hive_test_table values (9, '2023-01-12 20:00:01.12345');
-- query with spark and hive
spark-sql (default)> select * from test.hive_test_table where a = 9;
9 2023-01-12 20:00:01.12345
hive> select * from test.hive_test_table where a = 9;
OK
9 2023-01-12 20:00:01.12345 PRC
API and Format
no
Documentation
no
The ut case failed. HiveWriteITCase#testInsertLocalZonedTimestamp
The ut case failed. HiveWriteITCase#testInsertLocalZonedTimestamp
Has fixed, please review again. Thanks.
@zhuangchong help cc,thx
For type validation failure for Timestamp LTZ, I think the key problem is not the validation failure perhaps. Summarize the main steps of creating table which metastore is hms with spark sql:
- Transform spark type to paimon type(org.apache.paimon.spark.SparkTypeUtils).
timestampin spark will be transformed toTIMESTAMP(6) WITH LOCAL TIME ZONE. - Create paimon table in HiveCatalog(org.apache.paimon.hive.HiveCatalog#createTableImpl). When transforming paimon type to hive type,
TIMESTAMP(6) WITH LOCAL TIME ZONEwas transformed totimestampin hive and causing the subsequent validation to fail.
// org.apache.paimon.hive.HiveTypeUtils.PaimonToHiveTypeVisitor
@Override
public TypeInfo visit(LocalZonedTimestampType localZonedTimestampType) {
return LocalZonedTimestampTypeUtils.hiveLocalZonedTimestampType();
}
// org.apache.paimon.hive.LocalZonedTimestampTypeUtils
public static TypeInfo hiveLocalZonedTimestampType() {
try {
Class<?> typeInfoFactoryClass =
Class.forName("org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory");
Field field = typeInfoFactoryClass.getField("timestampLocalTZTypeInfo");
return (TypeInfo) field.get(null);
} catch (Exception e) {
return TypeInfoFactory.timestampTypeInfo;
}
}
In pr https://github.com/apache/paimon/pull/4571, it determines timestamp ltz type based on hive runtime version. Therefore, I think the key problem is to find why TIMESTAMP(6) WITH LOCAL TIME ZONE is tranformed to timestamp in hive, not just skipping the validation.
Maybe you can test this in org.apache.paimon.spark.sql.DDLWithHiveCatalogTest in paimon-spark-3.5.
For type validation failure for Timestamp LTZ, I think the key problem is not the validation failure perhaps. Summarize the main steps of creating table which metastore is hms with spark sql:
1. Transform spark type to paimon type(org.apache.paimon.spark.SparkTypeUtils). `timestamp` in spark will be transformed to `TIMESTAMP(6) WITH LOCAL TIME ZONE`. 2. Create paimon table in HiveCatalog(org.apache.paimon.hive.HiveCatalog#createTableImpl). When transforming paimon type to hive type, `TIMESTAMP(6) WITH LOCAL TIME ZONE` was transformed to `timestamp` in hive and causing the subsequent validation to fail.// org.apache.paimon.hive.HiveTypeUtils.PaimonToHiveTypeVisitor @Override public TypeInfo visit(LocalZonedTimestampType localZonedTimestampType) { return LocalZonedTimestampTypeUtils.hiveLocalZonedTimestampType(); }// org.apache.paimon.hive.LocalZonedTimestampTypeUtils public static TypeInfo hiveLocalZonedTimestampType() { try { Class<?> typeInfoFactoryClass = Class.forName("org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory"); Field field = typeInfoFactoryClass.getField("timestampLocalTZTypeInfo"); return (TypeInfo) field.get(null); } catch (Exception e) { return TypeInfoFactory.timestampTypeInfo; } }In pr #4571, it determines timestamp ltz type based on hive runtime version. Therefore, I think the key problem is to find why
TIMESTAMP(6) WITH LOCAL TIME ZONEis tranformed totimestampin hive, not just skipping the validation.Maybe you can test this in org.apache.paimon.spark.sql.DDLWithHiveCatalogTest in paimon-spark-3.5.
First, thank you for your previous response. The two-step process you mentioned about Spark table creation is correct. The first step of converting Spark types to Paimon types works properly. However, an issue occurs during the second step when using HiveCatalog (specifically org.apache.paimon.hive.HiveCatalog#createTableImpl) to create the Hive table. The error happens during the convertToFieldSchema conversion to metastore types, particularly at the LocalZonedTimestampTypeUtils#hiveLocalZonedTimestampType method.
As shown in the code snippet you referenced:
public static TypeInfo hiveLocalZonedTimestampType() {
try {
Class<?> typeInfoFactoryClass =
Class.forName("org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory");
Field field = typeInfoFactoryClass.getField("timestampLocalTZTypeInfo");
return (TypeInfo) field.get(null);
} catch (Exception e) {
// Exception occurs: java.lang.NoSuchFieldException: timestampLocalTZTypeInfo
return TypeInfoFactory.timestampTypeInfo;
}
}
The code itself is correct. It attempts to retrieve the timestampLocalTZTypeInfo property from Hive's TypeInfoFactory class in the hive-serde package. In Hive 3, this property exists to support the timestamp with local time zone type, which is a new feature compared to Hive 2. However, when executing the table creation via Spark SQL (Spark 3.5), the issue arises because Spark 3.5 internally bundles Hive 2.3.9. The Hive 2.x hive-serde dependency lacks the timestampLocalTZTypeInfo field, causing the code to fall back to returning timestampTypeInfo. Consequently, the metastore stores the field type as regular timestamp instead of timestamp with local time zone.
When using a native Hive 3 client to create Paimon tables, the timestamp with local time zone type is correctly preserved. To fundamentally resolve this Spark-related issue, one would need to upgrade Spark's built-in Hive dependency to Hive 3. However, this is highly challenging due to Spark's deep integration with Hive 2.x at the code level. Modifying Spark's internal dependencies would be non-trivial and falls outside Paimon's scope of responsibility.
Therefore, a more practical solution for Paimon users—as implemented in my PR—is to relax the type validation checks in Paimon's code. This adjustment allows the system to gracefully handle Spark's Hive 2.x limitations while maintaining compatibility with Hive 3 environments.
@Jack1007 Thanks for your explanation! It seems like a compatible problem between hive 2 and hive 3. Could you please separate this pr to two PRs, one for type validation and one for time offset? And create a new issue to describe the problem of time offset in detail, as the handling of time offsets in Hive is complex.
@Jack1007 Thanks for your explanation! It seems like a compatible problem between hive 2 and hive 3. Could you please separate this pr to two PRs, one for type validation and one for time offset? And create a new issue to describe the problem of time offset in detail, as the handling of time offsets in Hive is complex.
OK, This PR will focus on fixing the timestamp type validation issue. I will create a new PR specifically to address and explain the timezone offset problems along with the proposed resolution steps.
I have split this PR into two separate ones. This PR retains only the code fixes for Hive timestamp type validation, while the other PR specifically addresses and explains the time zone offset issue with the timestamp with local time zone data type in Hive.
The other PR can be found here: 5571
@LsomeYeah Please review this PR again. Thank you!