overwatch
overwatch copied to clipboard
[Bug] - VerifyMinimumSchema
Unfortunately, there's a bug in verifyMinimumSchema when verifying struct<array<struct<string, struct>>>. Example field field is jobs_snapshot_bronze.settings.job_clusters. Below is a test case.
The issue is that the bottom layer array fields are being created into an array; job_cluster_key and new_cluster should be string and struct respectively but are both getting wrapped into an array. Furthermore, logic is missing to parse the prefix properly. An array column is not accessible via settings.job_clusters.job_cluster_key, the field would first need to be exploded and re-wrapped.
Some changes are needed to the logic of nested arrays to fully support these.
val jobSnapMinimumSchema: StructType = StructType(Seq(
StructField("settings", StructType(Seq(
StructField("name", StringType, nullable = true),
StructField("existing_cluster_id", StringType, nullable = true),
StructField("job_clusters", minimumJobClustersSchema, nullable = true),
StructField("new_cluster", minimumNewClusterSchema, nullable = true)
)), nullable = true),
StructField("organization_id", StringType, nullable = false)
))
val df = Seq(("123")).toDF("organization_id")
.withColumn("settings", struct(
array(
struct(
lit("my_cluster_name").alias("job_cluster_key"),
struct(
lit("my_new_cluster_name").cast("string").alias("cluster_name")
).alias("new_cluster")
)
).alias("job_clusters")
))
val validatedDF = df.verifyMinimumSchema(jobSnapMinimumSchema)
validatedDF.printSchema()
To get around this for next release, OW has removed validation requirements for job_clusters. If no jobs exist in a customer workspace with jobs_clusters (i.e. they are all existing) this will cause a pipeline failure. At least one job should be created with a job_cluster to avoid this issue while this bug is resolved.
@souravbaner-da, what does the comment here imply now that this issue is resolved?
https://github.com/databrickslabs/overwatch/blob/9eb4a3c2c0f3e81ae2d1b1307f9d553562f17a16/src/main/scala/com/databricks/labs/overwatch/pipeline/WorkflowsTransforms.scala#L1055
There are other comments in the 0810_release branch that say "TODO . . . after 503" or similar in these files:
WorkflowsTransforms.scalaSilverTransforms.scalaGoldTransforms.scala