[SPARK-38639][HIVE]Ignore the corrupted rows that failed to deserialize in hive sequence table
What changes were proposed in this pull request?
Original pr: https://github.com/apache/spark/pull/35963
When reading the hive sequence table, you can switch whether to skip corrupt records that fail to be deserialized.
add new parameter: spark.sql.hive.ignoreCorruptRows, the default value is false
Why are the changes needed?
If skipping the corrupt records does not supported, when task read a file with corrupt records , the job will fail and other normal data files cannot also be read out.
The error stack information:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 30 in stage 8.0 failed 4 times, most recent failure: Lost task 30.3 in stage 8.0 (TID 5380) (zjy-hadoop-prc-st2338.bj executor 425): org.apache.hadoop.hive.serde2.SerDeException: org.apache.thrift.protocol.TProtocolException: don't know what type: 15
at org.apache.hadoop.hive.serde2.thrift.ThriftByteStreamTypedSerDe.deserialize(ThriftByteStreamTypedSerDe.java:80)
at org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer.deserialize(ThriftDeserializer.java:74)
at org.apache.spark.sql.hive.HadoopTableReader$.$anonfun$fillObject$18(TableReader.scala:485)
Caused by: org.apache.thrift.protocol.TProtocolException: don't know what type: 15
at org.apache.thrift.protocol.TCompactProtocol.getTType(TCompactProtocol.java:898)
at org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:562)
at com.xiaomi.data.spec.log.misearch.AiMusicSearchLog.read(AiMusicSearchLog.java:2418)
at org.apache.hadoop.hive.serde2.thrift.ThriftByteStreamTypedSerDe.deserialize(ThriftByteStreamTypedSerDe.java:78)
Does this PR introduce any user-facing change?
add new parameter: spark.sql.hive.ignoreCorruptRows, the default value is false
How was this patch tested?
manually test
gently ping @cloud-fan
Does Hive have this feature?
Can one of the admins verify this patch?
Does Hive have this feature?
@cloud-fan Thank you for your reply. I have also tested this with Hive and encountered the same deserialization exception, so i think that Hive currently also does not support skipping corrupted rows.
Does Hive have this feature?
@cloud-fan Thank you for your reply. I have also tested this with Hive and encountered the same deserialization exception, so i think that Hive currently also does not support skipping corrupted rows.
However, the entire file was unreadable due to corrupted rows,and it could cause very serious problems, such as the entire file is lost or cannot be recovered. In addition, Hive SQL is gradually migrating to Spark SQL in the industry , so this feature may be very necessary for Spark.
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!