source_path_schema for silver layer?
Hi team & @ravi-databricks ,
I'm currently implementing the DLT-Meta framework at one of our customers. I'm pretty happy with the process so far. However, I have been wondering why does DLT-META enforce schemas already at the bronze layer with souce_path_schema? In fact, we do not have the possibility to infer schema in the silver layer at the current stage.
In my opinion, we should just store columns as strings or similar and start inferring in the silver layer.
In the databricks documentation "What is the medallion lakehouse architecture?" the following is documented:
Minimal data validation is performed in the bronze layer. To ensure against dropped data, Databricks recommends storing most fields as string, VARIANT, or binary to protect against unexpected schema changes. Metadata columns might be added, such as the provenance or source of the data (for example, _metadata.file_name ).
Why are we not aligned with this? Could it be an idea to add schema support to silver specs? I'm happy to contribute if this is considered an improvement?
BR Peer
@PeerW souce_path_schema attribute is optional, check demo's and their onboarding config files. e.g. In silver fan out demo demo's onboarding_file there is no schema provided. Please check pipeline_reader.py where we do not enforce schema if not provided in onboarding file.
@ravi-databricks
Thanks for the the quick reply.
I'm ware about the source_path_schema being optional in Bronze. My ask is that Silver should work the same way—i.e. allow an optional schema (or source_schema_path)—but today it doesn’t. Here’s what the code shows:
-
No
schemain Silver specs-
SilverDataflowSpecindataflow_spec.pyhas noschemaorsource_schema_pathfield. It's only available for the append flows.
-
-
self.schema_jsonis only set for Bronze in dataflow_pipeline.pydef __initialize_dataflow_pipeline( self, spark, dataflow_spec, view_name, view_name_quarantine, custom_transform_func, next_snapshot_and_version ): # … if isinstance(dataflow_spec, BronzeDataflowSpec): # Bronze sets schema_json from its spec if dataflow_spec.schema is not None: self.schema_json = json.loads(dataflow_spec.schema) else: self.schema_json = None else: # Silver always ends up here—schema_json is never set self.schema_json = None self.next_snapshot_and_version = None self.appy_changes_from_snapshot = None self.silver_schema = NoneThis shows that in the else statement for silver the
self.schema_jsonis set toNone. -
read_silver()never passes a schema in dataflow_pipeline.py It always does:spark.readStream.table(...).selectExpr(...) # or spark.readStream.load(...).selectExpr(...)without invoking
PipelineReaders(..., schema_json=...). -
PipelineReaderscan apply a schema, but only if provided
Itsread_dlt_cloud_files()andread_kafka()methods got theself.schema_json, but since Silver never supplies one, the schema is always inferred from the existing tables.
Proposal:
- Add a
schema: strfield toSilverDataflowSpec. - In
DataflowPipeline.__initialize_dataflow_pipeline, load it intoself.schema_jsonfor Silver. - In
read_silver(), passself.schema_jsonintoPipelineReaders(just like Bronze’sread_bronze), so Silver readers can optionally enforce an external DDL/JSON schema.
I’m happy to try to submit a PR for these changes (I'm not experienced in contributing to open-source)—this will let Silver layer specs optionally define and enforce a schema, aligning with the flexibility Bronze already has. IMO it would allow to better follow the referenced medallion architecture. Read bronze as strings and infer/provide ddl schema in the silver layer.
What are your thoughts?
@PeerW for silver schema is derived based on your silver_transformations and transformation functions you provide. In dlt-meta bronze and silver layers are tied together( one tag in onboarding json), so if we created bronze table already we know the schema source details everything and in order to read bronze delta table why would you provide schema? if you want to read everything as string you can based on your sql_transformations.sql file.
If you are just providing source schema and do not provide sql transformations in order to write to silver table directly doing apply_changes for such scenario it would be logically considered as bronze layer.
You can provide PR with unit tests + integration tests and we will take a look if it make sense.
@ravi-databricks Thanks again for the quick reply. Maybe I'm on the wrong path here. So let me try to explain my use-case in more detail.
In my onboarding JSON (example below), I already supply a source_schema_path pointing to a DDL file. I would like the Bronze layer to load every column as STRING—i.e. treat the incoming files as raw as possible and not drop or cast anything based on the DDL. Then, in the Silver layer, I would like to run SELECT * and infer the schema from the a second DDL schema file (with the actual datatypes) and apply it to the new Silver Delta table.
For me this gives better traceability between layers since I never recast columns or drop rows in Bronze; everything that lands there is stored as strings. Schema evolution: schema changes (new fields, type promotions) don’t break Bronze ingestion—they’re caught and handled downstream in Silver. Easier debugging: I can always to back to bronze to see the raw format of the table as strings.
I hope this clarifies my use-case and thinking. My main issue with the current flow is that I NEED to provide the source_path_schema already in bronze and cannot first provide it in silver.
Below is one of my onboarding JSON files (silver selectExp=["*"]):
{
"data_flow_id": "169",
"data_flow_group": "sap4hana",
"source_system": "sap4hana",
"source_format": "cloudFiles",
"source_details": {
"source_database": "sap4hana",
"source_table": "zc_bide_product",
"source_path_dev": "/Volumes/dataplatform01_dev/raw_sap4hana/rawdata/masterdata/ZC_BIDE_PRODUCT/",
"source_metadata": {
"include_autoloader_metadata_column": "True",
"autoloader_metadata_col_name": "source_metadata",
"select_metadata_cols": {
"input_file_name": "_metadata.file_name",
"input_file_path": "_metadata.file_path"
}
},
"source_schema_path": "abfss://[email protected]/dlt_meta/source_systems/sap4hana/schema/schema_ZC_BIDE_PRODUCT.ddl"
},
"bronze_database_dev": "dataplatform01_dev.raw_sap4hana",
"bronze_table": "zc_bide_product",
"bronze_cluster_by": [
"PRODUCT"
],
"bronze_reader_options": {
"cloudFiles.format": "parquet",
"cloudFiles.rescuedDataColumn": "_rescued_data",
"cloudFiles.schemaEvolutionMode": "rescue",
"readerCaseSensitive": "false"
},
"bronze_table_properties": {
"pipelines.autoOptimize.managed": "true",
"pipelines.reset.allowed": "false",
"delta.dataSkippingNumIndexedCols": "-1"
},
"bronze_data_quality_expectations_json_dev": "dbfs:/dlt_meta/dltmeta_conf/dqe/sap4hana/bronze_data_quality_expectations_zc_bide_product.json",
"bronze_database_quarantine_dev": "dataplatform01_dev.dlt_bronze_quarantine",
"bronze_quarantine_table": "zc_bide_product_quarantine",
"bronze_quarantine_table_properties": {
"pipelines.autoOptimize.managed": "true",
"pipelines.reset.allowed": "false",
"delta.dataSkippingNumIndexedCols": "-1"
},
"silver_database_dev": "dataplatform01_dev.raw_sap4hana",
"silver_table": "zc_bide_product",
"silver_cluster_by": [
"PRODUCT"
],
"silver_cdc_apply_changes": {
"keys": [
"PRODUCT"
],
"sequence_by": "__timestamp",
"scd_type": "1",
"apply_as_deletes": "false",
"except_column_list": [],
"flow_name": "silver_sap4hana_zc_bide_product_cdc_apply_changes"
},
"silver_transformation_json_dev": "dbfs:/dlt_meta/dltmeta_conf/silver_transformations.json",
"silver_table_properties": {
"pipelines.autoOptimize.managed": "true",
"pipelines.reset.allowed": "true",
"delta.dataSkippingNumIndexedCols": "-1"
},
"silver_data_quality_expectations_json_dev": "dbfs:/dlt_meta/dltmeta_conf/dqe/sap4hana/silver_data_quality_expectations_zc_bide_product.json"
}
We developed dlt-meta based on working with many customers, which helped shape the core requirements. While the bronze layer is typically considered raw, most customers prefer to apply basic schema parsing and data quality (DQ) rules at that stage.
Silver, by definition, handles enrichment—this is expressed as SQL SELECT expressions, meaning the silver schema is essentially a set of enrichments.
The pattern you mentioned isn't directly supported, but it can be achieved by defining two bronze layers:
-
Raw Bronze – purely raw ingestion with no schema.
-
Processed Bronze – applies schema and uses apply_changes.
Feel free to submit a PR if you’d like to see this scenario supported more explicitly—we’d be happy to review it.
Thanks Ravi, I will try to submit something as soon as I find the time.