dbldatagen icon indicating copy to clipboard operation
dbldatagen copied to clipboard

Error in creating ArrayType cols

Open danielm-db opened this issue 3 years ago • 2 comments

Expected Behavior

When creating a column from existing schema or new that is of a composite type such as an Array of integers the expected behaviour is to have a column generated in the same manner as it would if it was just a combination of many integer columns and not throw an error.

Current Behavior

Error thrown: AnalysisException: cannot resolve '(id + CAST(0 AS BIGINT))' due to data type mismatch: cannot cast bigint to array;

Steps to Reproduce (for bugs)

import dbldatagen as dg
from pyspark.sql.types import IntegerType, FloatType, StringType
column_count = 10
data_rows = 1000 * 1000
df_spec = (dg.DataGenerator(spark, name="test_data_set1", rows=data_rows,
                                                  partitions=4)
                            .withIdOutput()
                             .withColumn("r", FloatType(), expr="floor(rand() * 350) * (86400 + 3600)",
                                         numColumns=column_count)
                            .withColumn("code1", IntegerType(), minValue=100, maxValue=200)
                            .withColumn("code2", IntegerType(), minValue=0, maxValue=10)
                            .withColumn("code3", StringType(), values=['a', 'b', 'c'])
                            .withColumn("code4", StringType(), values=['a', 'b', 'c'], random=True)
                            .withColumn("code5", StringType(), values=['a', 'b', 'c'], random=True, weights=[9, 1, 1])
                            .withColumn("a", ArrayType(StringType()))
                            )

df = df_spec.build()
display(df)

Context

Your Environment

  • dbldatagen version used:
  • Databricks Runtime version:
  • Cloud environment used:

danielm-db avatar Jun 28 '22 10:06 danielm-db

Thanks for your feedback - i'll review the above and look into it

ronanstokes-db avatar Jul 14 '22 16:07 ronanstokes-db

I'll add a fix for this - for now there are several workarounds:

 df_spec = (dg.DataGenerator(spark, name="test_data_set1", rows=data_rows,
                                    partitions=spark.sparkContext.defaultParallelism)
                   # generate array with same data definition for each element
                   .withColumn("r", FloatType(), expr="floor(rand() * 350) * (86400 + 3600)",
                               numColumns=column_count, structType="array")

                   # alternatively manually assemble array contents
                   .withColumn("a", ArrayType(StringType()), expr="array('one', 'two', 'three')" )
                  
                   # alternatively use intermediate columns for elements
                   .withColumn("code1", StringType(), values=['a', 'b', 'c'])
                   .withColumn("code2", StringType(), values=['a', 'b', 'c'], random=True)
                   .withColumn("code3", StringType(), values=['a', 'b', 'c'], random=True, weights=[9, 1, 1])
                   .withColumn("a2", ArrayType(StringType()), expr="array(code1, code2, code3)" )
                   )

ronanstokes-db avatar Oct 14 '22 21:10 ronanstokes-db

Fixed - but array valued columns must have expr attribute in order to get value

ronanstokes-db avatar Feb 28 '23 06:02 ronanstokes-db