dlt
dlt copied to clipboard
feat: `Dataset.write()`
Users of dlt.Dataset want a simple way to write data back to the dataset.
Use cases:
- manually review data and push corrected records
- simple way to add records if you don't have access to the original
dlt.Pipelineused to create the dataset
Other motivations
This interface will simplify data-centric operations involved in:
- storing data quality checks results on destination
- creating a graph of datasets where the "internal pipeline" of dataset is used
- integrate with orchestration frameworks
Specs
- Look at
WritableDataset.save()fromdlt-plus - Add
Dataset.write()indlt(this aligns withpipeline.run()operation)- Alternatives:
.write_to(),.load_into(),.load_table()
- Alternatives:
- create an internal
dlt.Pipelinenamed_dlt_dataset_{dataset_name} - find a way for the internal pipeline to use the
dlt.Schemafrom thedlt.Datasetinstance; this way, this schema should evolve whenDataset.load()is used - potential API
def write( self: dlt.Dataset, data: TDataItems, *, table_name: str, write_disposition: TWriteDisposition = "append", normalize: bool = False, ) -> LoadInfo: ...write_dispositionis useful to determine if we should append or modify existing recordsnormalizeallows the user to decide to enable normalization (which might create more tables)
- can accept a
dlt.Relationas input
Out of scope
Dataset.load()doesn't have to support 1-to-1 thedlt.Pipeline.run()method; if user needs full range of config, then they should create a pipeline
Deploy Preview for dlt-hub-docs ready!
| Name | Link |
|---|---|
| Latest commit | 2c08e885713d2ad58a696ee26c44e317e5d9b47d |
| Latest deploy log | https://app.netlify.com/projects/dlt-hub-docs/deploys/68cdfa69f43aca0008136715 |
| Deploy Preview | https://deploy-preview-3092--dlt-hub-docs.netlify.app |
| Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify project configuration.
Deploying with
Cloudflare Workers
The latest updates on your project. Learn more about integrating Git with Workers.
| Status | Name | Latest Commit | Updated (UTC) |
|---|---|---|---|
| ❌ Deployment failed View logs |
docs | c2ac3830 | Dec 02 2025, 11:25 AM |
my take would be to make internal pipeline used in write as invisible as possible possibly use pipelines-dir to hide it from command line and dashboard. in essence we pretend that this pipeline does not exist
I changed the internal pipeline to be a context manager that uses a temporary directory as pipelines_dir
disable destination sync, state sync and schema evolution (a total freeze on a table via contract)
I don't know exactly what I need to change / configure for destination and state sync (it doesn't seem to be in the kwargs for dlt.pipeline() and pipeline.run()).
For schema evolution, users should be able to modify schema. For example, someone wants to add a column or cast types. Though, I would have frozen schema as default and require users to explicitly change it.
just had a meeting with marcin. the plan is now:
- lets focus on Dataset.write() as a function to create standalone Datasets (vs. using the dataset from the pipeline)
- signature is Dataset.write(data, table_name: str, overwrite:bool)
- overwrite=True, respect the write disposition of the underlying data
- overwrite=False does refresh="drop_resources", writes new schema into dlt_versions
- document in docs: that if take the Dataset from the pipeline and introduce changes to the schema (new table (append) or overwrite) they need to do pipeline.sync_destination() to pull the changes into the pipeline
##Follow-up:
- pipeline sync should be aware of: schema needs to be decoupled from state-sync atm: new schemas won't automatically be picked up by a pipeline because pipeline compares local state to remote state (and state doesnt reference schema (dlt_version -table)
also: pipeline.sync_destination() only syncs if the remote state version has changed (which it hasn't after dataset.write() in which case it calls get_schema_from_destnation(always_download=False) which just looks up the schema in the local storage (where it hasnt changed)
-> maybe we should introduce a force_download-flag to the sync_destination call :thinking: otherwise it won't happen