spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[FEA]Support HiveTableScanExec to scan a Hive text table

Open viadea opened this issue 3 years ago • 3 comments

I wish we can support HiveTableScanExec to scan a Hive text table.

Say you already created a Hive text table using the code in https://github.com/NVIDIA/spark-rapids/issues/6419, then query it

spark.sql("select * from test_hive_text_table").collect

Not-supported-messages:

! <HiveTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.hive.execution.HiveTableScanExec

viadea avatar Aug 26 '22 02:08 viadea

A couple of questions for clarification:

  1. What is the priority of data types to be supported for Hive's text file-format? Specifically, are non-primitive types (e.g. ARRAY, STRUCT, MAP) an immediate priority?
  2. Are custom field / row delimiters important for the first cut? Hive (and likely Spark) support overriding the delimiter specs, such that CSV might well be a subset of the default Hive text serialization format.

mythrocks avatar Oct 11 '22 20:10 mythrocks

(I'm having a little trouble locating the official documentation for the following.)

The default Hive text serialization format uses LazySimpleSerde, with \n as the record delimiter, and a series of single-byte field delimiters, beginning with ^A:

  1. First level field delimiter: ^A, i.e. \u0001.
  2. Second level field delimiter: ^B, i.e. \u0002. (E.g. Array element separator for top-level arrays.)
  3. Third level field delimiter: ^C, i.e. \u0003. (E.g. Array element separator for ARRAY<ARRAY> elements, or key-value separators in top-level maps.)
  4. Et cetera.

Consider the following table schema:

CREATE TABLE f1_team(
  name STRING,
  drivers ARRAY<STRING>,
  top_positions <STRING, INT>) 
STORED AS TEXTFILE...

Consider the following records in said table:

{ "Mercedes", [ "Hamilton", "Russell" ], [ "Spa":4, "Monza":3, "MarinaBay":9 ] },
{ "RedBull",  [ "Verstappen", "Perez" ], [ "Spa":1, "Monza":1, "MarinaBay":1 ] }

By default, the Hive text format will serialize these records (in Hive and Spark) as follows:

Mercedes^AHamilton^BRussell^ASpa^C4^BMonza^C3^BMarinaBay^C9\n
RedBull^AVerstappen^BPerez^ASpa^C1^BMonza^C1^BMarinaBay^C1\n

A couple of notes regarding the CSV reader in libcudf as it currently stands:

  1. There is no support for nested types. No LIST (ARRAY), no STRUCT. Certainly no MAP.
  2. The field delimiter is configurable via csv_reader_options::set_delimiter(). Setting this to \u0001 might provide some semblance of the Hive default text serialization format.
  3. read_csv() does column type inference via heuristics. This may be overridden with an explicit type schema (via _dtypes), but nested types aren't supported.

Using the existing CSV reader to support Hive's text serialization format will prove limited in scope. I'm still investigating.

mythrocks avatar Oct 11 '22 22:10 mythrocks

A couple of questions for clarification:

1. What is the priority of data types to be supported for Hive's text file-format? Specifically, are non-primitive types (e.g. `ARRAY`, `STRUCT`, `MAP`) an immediate priority?

Complex types are lower priority than primitive types.

2. Are custom field / row delimiters important for the first cut? Hive (and likely Spark) support overriding the delimiter specs, such that CSV might well be a subset of the default  Hive text serialization format.

Custom field delimiters are not priority for the first cut.

sameerz avatar Oct 13 '22 20:10 sameerz

This issue is partially addressed by the following PR: https://github.com/NVIDIA/spark-rapids/pull/7013

In the 22.12 release, there is support for default Hive text tables (i.e. ^A separated fields, newline separated records, \\ for escapes). This also has support for partitioned Hive text tables.

The following are not supported:

  1. Custom delimiters (i.e. other than ^A with newlines).
  2. Complex datatypes like STRUCT, ARRAY, MAP, BINARY.

Additionally, there are some deviations from the behaviour of Hive's LazySimpleSerDe:

  1. NVIDIA/spark-rapids/issues/7068
  2. NVIDIA/spark-rapids/issues/7069
  3. NVIDIA/spark-rapids/issues/7085
  4. NVIDIA/spark-rapids/issues/7086

The GPU reader should still be safe to use, if the conditions described above do not occur in the input data.

mythrocks avatar Nov 30 '22 18:11 mythrocks

This issue should be safe to close now. The pending items are as follows:

  1. Support for complex types (ARRAY, STRUCT, MAP, BINARY).
  2. Support for escape characters.
  3. Support for custom row/column delimiters, custom escape characters, etc.

This issue captures reading primitive types, with default table settings. We can pursue the remainder in separate issues, whenever they are prioritized.

mythrocks avatar Feb 14 '23 18:02 mythrocks