sedona icon indicating copy to clipboard operation
sedona copied to clipboard

geoparquet table used by Hive or other components

Open MyqueWooMiddo opened this issue 2 years ago • 4 comments

Expected behavior

we can use sedona's GIS function to generate geometry type then df.write.format("geoparquet").saveAsTable(xxx) or spark.sql("create table xx as ") then we can print actual schema both in spark / hive / impala

Actual behavior

then we can print actual schema only in spark , but SequenceFile in hms and Array type in hive / impala

Settings

Sedona version = 1.4.0

Apache Spark version = 3.2.2

API type = Pure SQL & spark-shell

Scala version = 2.12

JRE version = 1.8

Environment = Ambari+apache community components

MyqueWooMiddo avatar May 06 '23 09:05 MyqueWooMiddo

As far as I know, Hive and Impala do not natively support geometry types or the GeoParquet specification, so the best thing they can do is to read the geometry column in the GeoParquet file as its physical type, which is BINARY type. Users have to decode the binary value as WKB manually. I think Apache Sedona can hardly help improve this situation.

Kontinuation avatar May 06 '23 14:05 Kontinuation

As far as I know, Hive and Impala do not natively support geometry types or the GeoParquet specification, so the best thing they can do is to read the geometry column in the GeoParquet file as its physical type, which is BINARY type. Users have to decode the binary value as WKB manually. I think Apache Sedona can hardly help improve this situation.

@Kontinuation

Even though Hive / Impala donot support geometry type , but they should support other types in geoparquet file .

Actually , we execute "show create table xxx " in Hive / Impala , the file format is SequenceFile rather then Parquet , and the column is just an ARRAY , and could not be accessed by any query .

MyqueWooMiddo avatar May 11 '23 08:05 MyqueWooMiddo

Even though Hive / Impala donot support geometry type , but they should support other types in geoparquet file .

Actually , we execute "show create table xxx " in Hive / Impala , the file format is SequenceFile rather then Parquet , and the column is just an ARRAY , and could not be accessed by any query .

It is strange that Hive did not treat GeoParquet files as Parquet files. I'll set up a Hive metastore and try reproducing this problem.

Kontinuation avatar May 11 '23 08:05 Kontinuation

Even though Hive / Impala donot support geometry type , but they should support other types in geoparquet file . Actually , we execute "show create table xxx " in Hive / Impala , the file format is SequenceFile rather then Parquet , and the column is just an ARRAY , and could not be accessed by any query .

It is strange that Hive did not treat GeoParquet files as Parquet files. I'll set up a Hive metastore and try reproducing this problem.

spark's datasource table could not be accessed by other query engines , not only geoparquet , csv either.

MyqueWooMiddo avatar Aug 30 '23 03:08 MyqueWooMiddo