spark-redshift icon indicating copy to clipboard operation
spark-redshift copied to clipboard

java.sql.Types.REAL always cast to java.lang.Double

Open ebirukov opened this issue 7 years ago • 2 comments

I have table in redshift CREATE TABLE ww_mana_account ( grind_value FLOAT4, prize_value FLOAT4, buy_value FLOAT4 );

All versions redshift jdbc driver (version jdbc 4,4.1,4.2) return metadata getMetaData().getColumnType(..) = 7 (REAL), getMetaData().getColumnTypeName(..) return "float4"

But spark-redshift cast it to java.lang.Double

According http://docs.aws.amazon.com/redshift/latest/dg/r_Numeric_types201.html

Name | Storage | Range REAL or FLOAT4 | 4 bytes | 6 significant digits of precision DOUBLE PRECISION, FLOAT8, or FLOAT | 8 bytes | 15 significant digits of precision

To correctly convert to parquet format, we need a float (4 bytes)

ebirukov avatar Aug 14 '17 14:08 ebirukov

I temporarily fixed this problem in my pyspark script by writing the following code:

def modifySchema(df):
    props = sc._jvm.java.util.Properties()
    jdbcUtils = sc._jvm.org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils
    rsmd = jdbcUtils.createConnectionFactory(jdbcURL, props) \
                    .apply() \
                    .prepareStatement("select * from ww_mana_account where 1 = 1") \
                    .executeQuery() \
                    .getMetaData()
    oldSchema = df.schema
    modifiedSchema = StructType()
    for i in range(1, rsmd.getColumnCount() + 1):
        curField = oldSchema[i - 1]
        if rsmd.getColumnTypeName(i) == 'float4':
            modifiedSchema.add(StructField(curField.name,FloatType(),curField.nullable) )
        else:
            modifiedSchema.add(curField)
    return modifiedSchema

ebirukov avatar Aug 16 '17 05:08 ebirukov

I face the same issue with Spark version 3.1.3, but a different database. Float4 (REAL) is always casted to DoubleType, which causes issues. Any chance that this gets fixed?

fgropengieser avatar Jun 20 '22 12:06 fgropengieser