How to set template and min,max value for a nested schema attribute
Expected Behavior
I have a nested schema for the data set and want to set the value template patterns for the attributes bankAcctId,bankProduct,bankProduct,storeGroup,association,merchantId,terminalId using withColumnSpec to generate the synthetic data.
my_schema = StructType(
[
StructField(
"bank",
StructType(
[
StructField("bankAcctId", StringType()),
StructField("bankProduct", StringType()),
]
),
),
StructField(
"merchDetails",
StructType(
[
StructField("storeGroup", StringType()),
StructField("association", StringType()),
StructField("merchantId", StringType()),
StructField(
"terminal",
StructType(
[
StructField("terminalId", StringType()),
StructField("cardholderActivatedTerm", StringType()),
StructField(
"posInteractionTerminalEntryMode", StringType()
),
]
),
),
]
),
),
]
)
I tried the below code snippet to build the synthetic data
testDataSpec = (
dg.DataGenerator(spark, name="test_data_set1", rows=row_count, partitions=4)
.withIdOutput()
.withSchema(my_schema)
)
testDataSpec = (
testDataSpec.withColumnSpec("bank.bankAcctId", template=r"\\n-\\n")
.withColumnSpec("merchDetails.storeGroup", template=r"\\n-\\n")
)
dfTestData = testDataSpec.build()
The code execution was failed with error
dbldatagen.utils.DataGenError: DataGenError(msg=' column `bank.bankAcctId` must refer to defined column', baseException=None)
I looking for some direction or example on how to use it.
Your Environment
Running it on mac m1 pro ( macOS venture 13.5)
dbldatagenversion used:0.3.5
Hi
The way to specify how the data is generated for nested structures is to create temporary fields and generate the values for them and then combine the generated fields into the desired structure. You cant refer to a nested field in the data generation rules at present.
See the following documentation page for more information: https://databrickslabs.github.io/dbldatagen/public_docs/generating_json_data.html#generating-complex-column-data
I'll update the documentation to provide some clearer examples when creating the data using an existing schema