Add many-to-many relationship between dataset versions and jobs
When a dataset is created in a job that fails, re-running the script creates a new job but the dataset still references the old (failed) job. This makes the dataset invisible in subsequent job runs in Studio UI. This PR implements a many-to-many relationship between dataset versions and jobs using a dataset_version_jobs junction table with an is_creator flag. It also fixes current issue in checkpoint code - when checkpoint is found what dataset version to return? Currently we return latest version but that's wrong as it might not be the one that is created in this job chain (it may be created by some other random job) so in this PR we try to find the correct version created in this specif job chain / hierarchy.
Changes:
- Added
dataset_version_jobstable to metastore schemas - Added
link_dataset_version_to_job(),get_ancestor_job_ids(), andget_dataset_version_for_job_ancestry()metastore methods - Updated
DataChain._resolve_checkpoint()to use job ancestry for finding dataset versions - Updated
DatasetQuery.save()to link datasets to jobs on creation
The existing job_id column in dataset_versions table remains unchanged for backward compatibility and can be deprecated in a future release.
Jobs that create datasets are linked with is_creator=True, while jobs that reuse datasets via checkpoints are linked with is_creator=False.
Codecov Report
:x: Patch coverage is 91.17647% with 9 lines in your changes missing coverage. Please review.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| src/datachain/data_storage/metastore.py | 85.00% | 6 Missing and 3 partials :warning: |
:loudspeaker: Thoughts on this report? Let us know!
Deploying with
Cloudflare Workers
The latest updates on your project. Learn more about integrating Git with Workers.
| Status | Name | Latest Commit | Updated (UTC) |
|---|---|---|---|
| ❌ Deployment failed View logs |
datachain-docs | 6dc37398 | Dec 03 2025, 02:20 AM |
Deploying datachain with
Cloudflare Pages
| Latest commit: |
dd9070b
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://d7ad99c3.datachain-2g6.pages.dev |
| Branch Preview URL: | https://ilongin-1477-fix-dataset-job.datachain-2g6.pages.dev |