[FEA]Support HiveTableScanExec to scan a Hive text table
I wish we can support HiveTableScanExec to scan a Hive text table.
Say you already created a Hive text table using the code in https://github.com/NVIDIA/spark-rapids/issues/6419, then query it
spark.sql("select * from test_hive_text_table").collect
Not-supported-messages:
! <HiveTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.hive.execution.HiveTableScanExec
A couple of questions for clarification:
- What is the priority of data types to be supported for Hive's text file-format? Specifically, are non-primitive types (e.g.
ARRAY,STRUCT,MAP) an immediate priority? - Are custom field / row delimiters important for the first cut? Hive (and likely Spark) support overriding the delimiter specs, such that CSV might well be a subset of the default Hive text serialization format.
(I'm having a little trouble locating the official documentation for the following.)
The default Hive text serialization format uses LazySimpleSerde, with \n as the record delimiter, and a series of single-byte field delimiters, beginning with ^A:
- First level field delimiter:
^A, i.e.\u0001. - Second level field delimiter:
^B, i.e.\u0002. (E.g. Array element separator for top-level arrays.) - Third level field delimiter:
^C, i.e.\u0003. (E.g. Array element separator forARRAY<ARRAY>elements, or key-value separators in top-level maps.) - Et cetera.
Consider the following table schema:
CREATE TABLE f1_team(
name STRING,
drivers ARRAY<STRING>,
top_positions <STRING, INT>)
STORED AS TEXTFILE...
Consider the following records in said table:
{ "Mercedes", [ "Hamilton", "Russell" ], [ "Spa":4, "Monza":3, "MarinaBay":9 ] },
{ "RedBull", [ "Verstappen", "Perez" ], [ "Spa":1, "Monza":1, "MarinaBay":1 ] }
By default, the Hive text format will serialize these records (in Hive and Spark) as follows:
Mercedes^AHamilton^BRussell^ASpa^C4^BMonza^C3^BMarinaBay^C9\n
RedBull^AVerstappen^BPerez^ASpa^C1^BMonza^C1^BMarinaBay^C1\n
A couple of notes regarding the CSV reader in libcudf as it currently stands:
- There is no support for nested types. No
LIST(ARRAY), noSTRUCT. Certainly noMAP. - The field delimiter is configurable via
csv_reader_options::set_delimiter(). Setting this to\u0001might provide some semblance of the Hive default text serialization format. -
read_csv()does column type inference via heuristics. This may be overridden with an explicit type schema (via_dtypes), but nested types aren't supported.
Using the existing CSV reader to support Hive's text serialization format will prove limited in scope. I'm still investigating.
A couple of questions for clarification:
1. What is the priority of data types to be supported for Hive's text file-format? Specifically, are non-primitive types (e.g. `ARRAY`, `STRUCT`, `MAP`) an immediate priority?
Complex types are lower priority than primitive types.
2. Are custom field / row delimiters important for the first cut? Hive (and likely Spark) support overriding the delimiter specs, such that CSV might well be a subset of the default Hive text serialization format.
Custom field delimiters are not priority for the first cut.
This issue is partially addressed by the following PR: https://github.com/NVIDIA/spark-rapids/pull/7013
In the 22.12 release, there is support for default Hive text tables (i.e. ^A separated fields, newline separated records, \\ for escapes).
This also has support for partitioned Hive text tables.
The following are not supported:
- Custom delimiters (i.e. other than
^Awith newlines). - Complex datatypes like
STRUCT,ARRAY,MAP,BINARY.
Additionally, there are some deviations from the behaviour of Hive's LazySimpleSerDe:
- NVIDIA/spark-rapids/issues/7068
- NVIDIA/spark-rapids/issues/7069
- NVIDIA/spark-rapids/issues/7085
- NVIDIA/spark-rapids/issues/7086
The GPU reader should still be safe to use, if the conditions described above do not occur in the input data.
This issue should be safe to close now. The pending items are as follows:
- Support for complex types (
ARRAY,STRUCT,MAP,BINARY). - Support for escape characters.
- Support for custom row/column delimiters, custom escape characters, etc.
This issue captures reading primitive types, with default table settings. We can pursue the remainder in separate issues, whenever they are prioritized.