spark-bigquery-connector
spark-bigquery-connector copied to clipboard
Allow preserving column type parameters
I would like to have the ability to load parquet data using this connector into BigQuery tables that have columns with parameterized data types, without wiping them out. I tried using save mode=append, which resulted in the WRITE_APPEND writeDisposition, but it still removed the type parameters after loading.
Bump, this issue still applies, would love some help from the maintainers!
Fixed for numeric types in version 0.31.1
Reading into a Spark dataframe causes us to lose type parameters for string and bytes. This cannot be supported.
Tried this:
scala> spark.sql("create table temp(name varchar(20)) using parquet")
ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used
24/05/22 08:21:58 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
res0: org.apache.spark.sql.DataFrame = []
scala> spark.sql("desc formatted temp").show(100, truncate=false)
+----------------------------+--------------------------------------------------------------+-------+
|col_name |data_type |comment|
+----------------------------+--------------------------------------------------------------+-------+
|name |varchar(20) |NULL |
| | | |
|# Detailed Table Information| | |
|Catalog |spark_catalog | |
|Database |default | |
|Table |temp | |
|Owner |vkarve_google_com | |
|Created Time |Wed May 22 08:21:58 UTC 2024 | |
|Last Access |UNKNOWN | |
|Created By |Spark 3.5.0 | |
|Type |MANAGED | |
|Provider |parquet | |
|Location |hdfs://vkarve-iceberg-22-m/user/hive/warehouse/temp | |
|Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | |
|InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | |
|OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| |
+----------------------------+--------------------------------------------------------------+-------+
scala> spark.sql("select * from temp").printSchema
root
|-- name: string (nullable = true)
scala> spark.sql("select * from temp").schema
res3: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true))
Trying to force Spark to create such a dataframe doesn't work either.
val mytypes = StructType(Seq(StructField("name",VarcharType(20),true)))
mytypes: org.apache.spark.sql.types.StructType = StructType(StructField(name,VarcharType(20),true))
scala> spark.createDataFrame(df.rdd, mytypes)
java.lang.IllegalStateException: [BUG] logical plan should not have output of char/varchar type: LogicalRDD [name#9], false
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:187)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:182)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244)
.....