spark-bigquery-connector icon indicating copy to clipboard operation
spark-bigquery-connector copied to clipboard

Allow preserving column type parameters

Open kposborne opened this issue 3 years ago • 2 comments

I would like to have the ability to load parquet data using this connector into BigQuery tables that have columns with parameterized data types, without wiping them out. I tried using save mode=append, which resulted in the WRITE_APPEND writeDisposition, but it still removed the type parameters after loading.

kposborne avatar Feb 18 '22 03:02 kposborne

Bump, this issue still applies, would love some help from the maintainers!

sixdimensionalarray avatar Jun 02 '23 19:06 sixdimensionalarray

Fixed for numeric types in version 0.31.1

davidrabinowitz avatar Jul 10 '23 20:07 davidrabinowitz

Reading into a Spark dataframe causes us to lose type parameters for string and bytes. This cannot be supported.

Tried this:

scala> spark.sql("create table temp(name varchar(20)) using parquet")
ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used
24/05/22 08:21:58 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
res0: org.apache.spark.sql.DataFrame = []

scala> spark.sql("desc formatted temp").show(100, truncate=false)
+----------------------------+--------------------------------------------------------------+-------+
|col_name                    |data_type                                                     |comment|
+----------------------------+--------------------------------------------------------------+-------+
|name                        |varchar(20)                                                   |NULL   |
|                            |                                                              |       |
|# Detailed Table Information|                                                              |       |
|Catalog                     |spark_catalog                                                 |       |
|Database                    |default                                                       |       |
|Table                       |temp                                                          |       |
|Owner                       |vkarve_google_com                                             |       |
|Created Time                |Wed May 22 08:21:58 UTC 2024                                  |       |
|Last Access                 |UNKNOWN                                                       |       |
|Created By                  |Spark 3.5.0                                                   |       |
|Type                        |MANAGED                                                       |       |
|Provider                    |parquet                                                       |       |
|Location                    |hdfs://vkarve-iceberg-22-m/user/hive/warehouse/temp           |       |
|Serde Library               |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe   |       |
|InputFormat                 |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat |       |
|OutputFormat                |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat|       |
+----------------------------+--------------------------------------------------------------+-------+


scala> spark.sql("select * from temp").printSchema
root
 |-- name: string (nullable = true)


scala> spark.sql("select * from temp").schema
res3: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true))

vishalkarve15 avatar May 22 '24 08:05 vishalkarve15

Trying to force Spark to create such a dataframe doesn't work either.

val mytypes = StructType(Seq(StructField("name",VarcharType(20),true)))
mytypes: org.apache.spark.sql.types.StructType = StructType(StructField(name,VarcharType(20),true))

scala> spark.createDataFrame(df.rdd, mytypes)
java.lang.IllegalStateException: [BUG] logical plan should not have output of char/varchar type: LogicalRDD [name#9], false

  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:187)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:182)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244)
.....

vishalkarve15 avatar May 22 '24 08:05 vishalkarve15