dlt icon indicating copy to clipboard operation
dlt copied to clipboard

feat: `Dataset.write()`

Open zilto opened this issue 2 months ago • 3 comments

Users of dlt.Dataset want a simple way to write data back to the dataset.

Use cases:

  • manually review data and push corrected records
  • simple way to add records if you don't have access to the original dlt.Pipeline used to create the dataset

Other motivations

This interface will simplify data-centric operations involved in:

  • storing data quality checks results on destination
  • creating a graph of datasets where the "internal pipeline" of dataset is used
  • integrate with orchestration frameworks

Specs

  • Look at WritableDataset.save() from dlt-plus
  • Add Dataset.write() in dlt (this aligns with pipeline.run() operation)
    • Alternatives: .write_to(), .load_into(), .load_table()
  • create an internal dlt.Pipeline named _dlt_dataset_{dataset_name}
  • find a way for the internal pipeline to use the dlt.Schema from the dlt.Dataset instance; this way, this schema should evolve when Dataset.load() is used
  • potential API
    def write(
      self: dlt.Dataset,
      data: TDataItems,
      *,
      table_name: str,
      write_disposition: TWriteDisposition = "append",
      normalize: bool = False,
    ) -> LoadInfo: ...
    
    • write_disposition is useful to determine if we should append or modify existing records
    • normalize allows the user to decide to enable normalization (which might create more tables)
  • can accept a dlt.Relation as input

Out of scope

  • Dataset.load() doesn't have to support 1-to-1 the dlt.Pipeline.run() method; if user needs full range of config, then they should create a pipeline

zilto avatar Sep 16 '25 21:09 zilto

Deploy Preview for dlt-hub-docs ready!

Name Link
Latest commit 2c08e885713d2ad58a696ee26c44e317e5d9b47d
Latest deploy log https://app.netlify.com/projects/dlt-hub-docs/deploys/68cdfa69f43aca0008136715
Deploy Preview https://deploy-preview-3092--dlt-hub-docs.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

netlify[bot] avatar Sep 16 '25 21:09 netlify[bot]

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Updated (UTC)
❌ Deployment failed
View logs
docs c2ac3830 Dec 02 2025, 11:25 AM

my take would be to make internal pipeline used in write as invisible as possible possibly use pipelines-dir to hide it from command line and dashboard. in essence we pretend that this pipeline does not exist

I changed the internal pipeline to be a context manager that uses a temporary directory as pipelines_dir

disable destination sync, state sync and schema evolution (a total freeze on a table via contract)

I don't know exactly what I need to change / configure for destination and state sync (it doesn't seem to be in the kwargs for dlt.pipeline() and pipeline.run()).

For schema evolution, users should be able to modify schema. For example, someone wants to add a column or cast types. Though, I would have frozen schema as default and require users to explicitly change it.

zilto avatar Sep 20 '25 01:09 zilto

just had a meeting with marcin. the plan is now:

  • lets focus on Dataset.write() as a function to create standalone Datasets (vs. using the dataset from the pipeline)
  • signature is Dataset.write(data, table_name: str, overwrite:bool)
  • overwrite=True, respect the write disposition of the underlying data
  • overwrite=False does refresh="drop_resources", writes new schema into dlt_versions
  • document in docs: that if take the Dataset from the pipeline and introduce changes to the schema (new table (append) or overwrite) they need to do pipeline.sync_destination() to pull the changes into the pipeline

##Follow-up:

  • pipeline sync should be aware of: schema needs to be decoupled from state-sync atm: new schemas won't automatically be picked up by a pipeline because pipeline compares local state to remote state (and state doesnt reference schema (dlt_version -table)

djudjuu avatar Nov 27 '25 13:11 djudjuu

also: pipeline.sync_destination() only syncs if the remote state version has changed (which it hasn't after dataset.write() in which case it calls get_schema_from_destnation(always_download=False) which just looks up the schema in the local storage (where it hasnt changed) -> maybe we should introduce a force_download-flag to the sync_destination call :thinking: otherwise it won't happen

djudjuu avatar Dec 01 '25 14:12 djudjuu