dbldatagen icon indicating copy to clipboard operation
dbldatagen copied to clipboard

How to set template and min,max value for a nested schema attribute

Open galaxy79 opened this issue 2 years ago • 1 comments

Expected Behavior

I have a nested schema for the data set and want to set the value template patterns for the attributes bankAcctId,bankProduct,bankProduct,storeGroup,association,merchantId,terminalId using withColumnSpec to generate the synthetic data.

my_schema = StructType(
    [
        StructField(
            "bank",
            StructType(
                [
                    StructField("bankAcctId", StringType()),
                    StructField("bankProduct", StringType()),
                ]
            ),
        ),
        StructField(
            "merchDetails",
            StructType(
                [
                    StructField("storeGroup", StringType()),
                    StructField("association", StringType()),
                    StructField("merchantId", StringType()),
                    StructField(
                        "terminal",
                        StructType(
                            [
                                StructField("terminalId", StringType()),
                                StructField("cardholderActivatedTerm", StringType()),
                                StructField(
                                    "posInteractionTerminalEntryMode", StringType()
                                ),
                            ]
                        ),
                    ),
                ]
            ),
        ),
    ]
)

I tried the below code snippet to build the synthetic data

testDataSpec = (
    dg.DataGenerator(spark, name="test_data_set1", rows=row_count, partitions=4)
    .withIdOutput()
    .withSchema(my_schema)
)

testDataSpec = (
    testDataSpec.withColumnSpec("bank.bankAcctId", template=r"\\n-\\n")
    .withColumnSpec("merchDetails.storeGroup", template=r"\\n-\\n")
)
dfTestData = testDataSpec.build()

The code execution was failed with error

dbldatagen.utils.DataGenError: DataGenError(msg=' column `bank.bankAcctId` must refer to defined column', baseException=None)

I looking for some direction or example on how to use it.

Your Environment

Running it on mac m1 pro ( macOS venture 13.5)

  • dbldatagen version used:0.3.5

galaxy79 avatar Aug 25 '23 23:08 galaxy79

Hi

The way to specify how the data is generated for nested structures is to create temporary fields and generate the values for them and then combine the generated fields into the desired structure. You cant refer to a nested field in the data generation rules at present.

See the following documentation page for more information: https://databrickslabs.github.io/dbldatagen/public_docs/generating_json_data.html#generating-complex-column-data

I'll update the documentation to provide some clearer examples when creating the data using an existing schema

ronanstokes-db avatar Sep 08 '23 17:09 ronanstokes-db