spark [SPARK-38639][HIVE]Ignore the corrupted rows that failed to deserialize in hive sequence table

What changes were proposed in this pull request?

Original pr: https://github.com/apache/spark/pull/35963

When reading the hive sequence table, you can switch whether to skip corrupt records that fail to be deserialized. add new parameter: spark.sql.hive.ignoreCorruptRows, the default value is false

Why are the changes needed?

If skipping the corrupt records does not supported, when task read a file with corrupt records , the job will fail and other normal data files cannot also be read out.

The error stack information：

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 30 in stage 8.0 failed 4 times, most recent failure: Lost task 30.3 in stage 8.0 (TID 5380) (zjy-hadoop-prc-st2338.bj executor 425): org.apache.hadoop.hive.serde2.SerDeException: org.apache.thrift.protocol.TProtocolException: don't know what type: 15
at org.apache.hadoop.hive.serde2.thrift.ThriftByteStreamTypedSerDe.deserialize(ThriftByteStreamTypedSerDe.java:80)
at org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer.deserialize(ThriftDeserializer.java:74)
at org.apache.spark.sql.hive.HadoopTableReader$.$anonfun$fillObject$18(TableReader.scala:485)

Caused by: org.apache.thrift.protocol.TProtocolException: don't know what type: 15
at org.apache.thrift.protocol.TCompactProtocol.getTType(TCompactProtocol.java:898)
at org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:562)
at com.xiaomi.data.spec.log.misearch.AiMusicSearchLog.read(AiMusicSearchLog.java:2418)
at org.apache.hadoop.hive.serde2.thrift.ThriftByteStreamTypedSerDe.deserialize(ThriftByteStreamTypedSerDe.java:78)

Does this PR introduce any user-facing change?

add new parameter: spark.sql.hive.ignoreCorruptRows, the default value is false

How was this patch tested?

manually test

Jul 29 '22 08:07 caican00

gently ping @cloud-fan

Jul 29 '22 08:07 caican00

Does Hive have this feature?

Aug 02 '22 03:08 cloud-fan

Can one of the admins verify this patch?

Aug 02 '22 04:08 AmplabJenkins

Does Hive have this feature?

@cloud-fan Thank you for your reply. I have also tested this with Hive and encountered the same deserialization exception, so i think that Hive currently also does not support skipping corrupted rows.

Aug 03 '22 04:08 caican00

Does Hive have this feature?

@cloud-fan Thank you for your reply. I have also tested this with Hive and encountered the same deserialization exception, so i think that Hive currently also does not support skipping corrupted rows.

However, the entire file was unreadable due to corrupted rows，and it could cause very serious problems, such as the entire file is lost or cannot be recovered. In addition, Hive SQL is gradually migrating to Spark SQL in the industry , so this feature may be very necessary for Spark.

Aug 03 '22 05:08 caican00

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Nov 26 '22 00:11 github-actions[bot]