Error in creating ArrayType cols
Expected Behavior
When creating a column from existing schema or new that is of a composite type such as an Array of integers the expected behaviour is to have a column generated in the same manner as it would if it was just a combination of many integer columns and not throw an error.
Current Behavior
Error thrown: AnalysisException: cannot resolve '(id + CAST(0 AS BIGINT))' due to data type mismatch: cannot cast bigint to array;
Steps to Reproduce (for bugs)
import dbldatagen as dg
from pyspark.sql.types import IntegerType, FloatType, StringType
column_count = 10
data_rows = 1000 * 1000
df_spec = (dg.DataGenerator(spark, name="test_data_set1", rows=data_rows,
partitions=4)
.withIdOutput()
.withColumn("r", FloatType(), expr="floor(rand() * 350) * (86400 + 3600)",
numColumns=column_count)
.withColumn("code1", IntegerType(), minValue=100, maxValue=200)
.withColumn("code2", IntegerType(), minValue=0, maxValue=10)
.withColumn("code3", StringType(), values=['a', 'b', 'c'])
.withColumn("code4", StringType(), values=['a', 'b', 'c'], random=True)
.withColumn("code5", StringType(), values=['a', 'b', 'c'], random=True, weights=[9, 1, 1])
.withColumn("a", ArrayType(StringType()))
)
df = df_spec.build()
display(df)
Context
Your Environment
dbldatagenversion used:- Databricks Runtime version:
- Cloud environment used:
Thanks for your feedback - i'll review the above and look into it
I'll add a fix for this - for now there are several workarounds:
df_spec = (dg.DataGenerator(spark, name="test_data_set1", rows=data_rows,
partitions=spark.sparkContext.defaultParallelism)
# generate array with same data definition for each element
.withColumn("r", FloatType(), expr="floor(rand() * 350) * (86400 + 3600)",
numColumns=column_count, structType="array")
# alternatively manually assemble array contents
.withColumn("a", ArrayType(StringType()), expr="array('one', 'two', 'three')" )
# alternatively use intermediate columns for elements
.withColumn("code1", StringType(), values=['a', 'b', 'c'])
.withColumn("code2", StringType(), values=['a', 'b', 'c'], random=True)
.withColumn("code3", StringType(), values=['a', 'b', 'c'], random=True, weights=[9, 1, 1])
.withColumn("a2", ArrayType(StringType()), expr="array(code1, code2, code3)" )
)
Fixed - but array valued columns must have expr attribute in order to get value