dvc icon indicating copy to clipboard operation
dvc copied to clipboard

Allow import-db within stages

Open fabiannagel opened this issue 1 year ago • 2 comments

I'm missing the possibility to run import-db as part of my pipeline. Consider the following scenario:

stages:
  
  convert:
    cmd: python convert.py
    deps:
      - data/raw_input_data
    outs:
      - data/converted_input_data

  ingest:
    cmd: python ingest.py && dvc import-db --table ingested_data --conn pgsql -o data/database
    deps:
      - data/coverted_input_data
    outs:
      - data/database

  run:
    cmd: python app.py
    deps:
      - data/database

ingest consumes my converted input data, applies some transformations and populates the application database. The state of the database is persisted via import-db, which is the data dependency for running the application.

Right now, dvc repro throws the following error with this config:

Running stage 'ingest':
> dvc import-db --table ingested_data --conn pgsql -o data/database
ERROR: output 'data/database' is already specified in stage: 'ingest'.
Use `dvc remove ingest` to stop tracking the overlapping output.
ERROR: failed to reproduce 'ingest': failed to run: dvc import-db --table ingested_data --conn pgsql -o data/database, exited with 255

It would be great to have a flag telling dvc import-db that it is part of a pipeline such that overlapping outputs are not an issue.

fabiannagel avatar Oct 01 '24 08:10 fabiannagel

Make sense to expand pipeline stages to be DB import (or other imports?), wdyt @skshetry ?

For now I would recommend to run the query directly via Python script. You can take a look into DbDependency implementation and get some SQL wrapper code from it (it should not be very complicated I think).

shcheklein avatar Oct 01 '24 18:10 shcheklein

There are two issues here with the pipelines:

  1. The ingest stage is modifying another stage/dependency during repro, and,
  2. We don't support overlapping outputs.
  1. can be worked-around by adding a cp command and adding a different output. stage supports multiple output after all.

But 1) is not something that we can support. Modifying another stage in another stage cannot be supported by dvc, at least not without reducing reproducibility guarantees.

Reading your dvc.yaml code and naming conventions, I assume you are doing ETL and/or data-engineering which dvc import-db is not meant to solve. It is meant to work downstream of those ETL and data engineering pipelines, with end result from those pipelines. You may be better served by other tools like dbt here. Or, you can export it yourself, which is not that complicated.

DVC does something like follows:

import pandas as pd

idfs = pd.read_sql("ingested_data", "postgresql://username:password@hostname/db_name", chunksize=10_000)
with open(file, mode="w") as f:
  for i, df in enumerate(idfs):
      df.to_csv(f, header=i == 0, index=False)

skshetry avatar Oct 09 '24 07:10 skshetry