spark-redshift
spark-redshift copied to clipboard
java.sql.Types.REAL always cast to java.lang.Double
I have table in redshift CREATE TABLE ww_mana_account ( grind_value FLOAT4, prize_value FLOAT4, buy_value FLOAT4 );
All versions redshift jdbc driver (version jdbc 4,4.1,4.2) return metadata getMetaData().getColumnType(..) = 7 (REAL), getMetaData().getColumnTypeName(..) return "float4"
But spark-redshift cast it to java.lang.Double
According http://docs.aws.amazon.com/redshift/latest/dg/r_Numeric_types201.html
Name | Storage | Range REAL or FLOAT4 | 4 bytes | 6 significant digits of precision DOUBLE PRECISION, FLOAT8, or FLOAT | 8 bytes | 15 significant digits of precision
To correctly convert to parquet format, we need a float (4 bytes)
I temporarily fixed this problem in my pyspark script by writing the following code:
def modifySchema(df):
props = sc._jvm.java.util.Properties()
jdbcUtils = sc._jvm.org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils
rsmd = jdbcUtils.createConnectionFactory(jdbcURL, props) \
.apply() \
.prepareStatement("select * from ww_mana_account where 1 = 1") \
.executeQuery() \
.getMetaData()
oldSchema = df.schema
modifiedSchema = StructType()
for i in range(1, rsmd.getColumnCount() + 1):
curField = oldSchema[i - 1]
if rsmd.getColumnTypeName(i) == 'float4':
modifiedSchema.add(StructField(curField.name,FloatType(),curField.nullable) )
else:
modifiedSchema.add(curField)
return modifiedSchema
I face the same issue with Spark version 3.1.3, but a different database. Float4 (REAL) is always casted to DoubleType, which causes issues. Any chance that this gets fixed?