datachain icon indicating copy to clipboard operation
datachain copied to clipboard

Add many-to-many relationship between dataset versions and jobs

Open ilongin opened this issue 3 weeks ago • 2 comments

When a dataset is created in a job that fails, re-running the script creates a new job but the dataset still references the old (failed) job. This makes the dataset invisible in subsequent job runs in Studio UI. This PR implements a many-to-many relationship between dataset versions and jobs using a dataset_version_jobs junction table with an is_creator flag. It also fixes current issue in checkpoint code - when checkpoint is found what dataset version to return? Currently we return latest version but that's wrong as it might not be the one that is created in this job chain (it may be created by some other random job) so in this PR we try to find the correct version created in this specif job chain / hierarchy.

Changes:

  • Added dataset_version_jobs table to metastore schemas
  • Added link_dataset_version_to_job(), get_ancestor_job_ids(), and get_dataset_version_for_job_ancestry() metastore methods
  • Updated DataChain._resolve_checkpoint() to use job ancestry for finding dataset versions
  • Updated DatasetQuery.save() to link datasets to jobs on creation

The existing job_id column in dataset_versions table remains unchanged for backward compatibility and can be deprecated in a future release. Jobs that create datasets are linked with is_creator=True, while jobs that reuse datasets via checkpoints are linked with is_creator=False.

ilongin avatar Nov 26 '25 10:11 ilongin

Codecov Report

:x: Patch coverage is 91.17647% with 9 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/datachain/data_storage/metastore.py 85.00% 6 Missing and 3 partials :warning:

:loudspeaker: Thoughts on this report? Let us know!

codecov[bot] avatar Nov 26 '25 10:11 codecov[bot]

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Updated (UTC)
❌ Deployment failed
View logs
datachain-docs 6dc37398 Dec 03 2025, 02:20 AM

Deploying datachain with  Cloudflare Pages  Cloudflare Pages

Latest commit: dd9070b
Status: ✅  Deploy successful!
Preview URL: https://d7ad99c3.datachain-2g6.pages.dev
Branch Preview URL: https://ilongin-1477-fix-dataset-job.datachain-2g6.pages.dev

View logs