spark-bigquery-connector
spark-bigquery-connector copied to clipboard
SchemaConverters.toBigQueryType should return BigQuery TimestampType for a given spark TimestampType instead of IntType
Hi,
I can't figure why we get an Integer instead of LegacySQLTypeName.TIMESTAMP when converting spark schema into a BigQuery schema. https://github.com/GoogleCloudDataproc/spark-bigquery-connector/blob/master/connector/src/main/java/com/google/cloud/spark/bigquery/SchemaConverters.java#L418
Observed behavior while using DataSources from spark-bigquery-connector :
- Writing a DataFrame with java.sql.Timestamp column is correctly written as Timestamp into BigQuery table.
- Reading a BigQuery Table with Timestamp column is correctly converted into a java.sql.Timestamp in a DataFrame
DataSourceWriter seems to convert spark schema into avro schema which transform java.sql.Timestamp into logical type timestamp-micros. This logical type is then correctly handle by BigQuery and converted into BQ Timestamp.
DataSourceReader seems to use SchemaConverters.toSpark which convert BQ Timestamp into spark TimestampType.
SchemaConverters.toBigQueryType don't behave the same way and convert spark TimestampType into an INTEGER.
Is it possible to remove the comment SchemaConverters.toBigQueryType#L418 and return LegacySQLTypeName.TIMESTAMP instead of LegacySQLTypeName.INTEGER or maybe I am missing something.
Below is a simple example to illustrate that behaviors are not similar.
import com.google.cloud.bigquery.{Field, LegacySQLTypeName, Schema, StandardSQLTypeName}
import com.google.cloud.spark.bigquery.SchemaConverters
import org.apache.spark.sql.types.{StructField, StructType, TimestampType}
val structType: StructType = StructType.apply(Seq(StructField("timestamp", TimestampType)))
SchemaConverters.toBigQuerySchema(structType)
val bqSchemaLegacy = Schema.of(Field.of("timestamp", LegacySQLTypeName.TIMESTAMP))
SchemaConverters.toSpark(bqSchemaLegacy)
----- REPL Output -----
import com.google.cloud.bigquery.{Field, LegacySQLTypeName, Schema, StandardSQLTypeName}
import com.google.cloud.spark.bigquery.SchemaConverters
import org.apache.spark.sql.types.{StructField, StructType, TimestampType}
structType: org.apache.spark.sql.types.StructType = StructType(StructField(timestamp,TimestampType,true))
res0: com.google.cloud.bigquery.Schema = Schema{fields=[Field{name=timestamp, type=INTEGER, mode=NULLABLE, description=null, policyTags=null}]}
bqSchemaLegacy: com.google.cloud.bigquery.Schema = Schema{fields=[Field{name=timestamp, type=TIMESTAMP, mode=null, description=null, policyTags=null}]}
res1: org.apache.spark.sql.types.StructType = StructType(StructField(timestamp,TimestampType,true))
Thanks.
Hi,
I've tried returning a Timestamp and it's seems to work fine but as said, maybe I'm missing a case or something else ?