datachain icon indicating copy to clipboard operation
datachain copied to clipboard

Fix union breaking schema order

Open ilongin opened this issue 2 months ago • 8 comments

Union was breaking schema order mixing signals from multiple objects (e.g File). This PR fixes this issue. Related Studio issue: https://github.com/iterative/studio/issues/12188

ilongin avatar Oct 13 '25 12:10 ilongin

Reviewer's Guide

Refactors the internal column validation logic to preserve original column ordering during dataset union operations and adds a unit test to ensure the schema order remains consistent after union.

Class diagram for updated _validate_columns function

classDiagram
class _validate_columns {
  +left_columns: Iterable[ColumnElement]
  +right_columns: Iterable[ColumnElement]
  +return: list[str]
}
class ColumnElement {
  +name: str
}
_validate_columns --> ColumnElement: uses

Flow diagram for column validation and schema order preservation

flowchart TD
    A["left_columns (Iterable[ColumnElement])"] --> B["Extract left_names (list)"]
    C["right_columns (Iterable[ColumnElement])"] --> D["Extract right_names (list)"]
    B --> E["Sort left_names"]
    D --> F["Sort right_names"]
    E --> G["Compare sorted left_names and right_names"]
    F --> G
    G -- "If equal" --> H["Return left_names"]
    G -- "If not equal" --> I["Compute missing columns"]
    I --> J["Prepare error message"]

File-Level Changes

Change Details Files
Add test to verify schema order is preserved during union
  • Introduce test_union_does_not_break_schema_order in test_datachain.py
  • Define a Meta model and helper functions add_file and add_meta for test setup
  • Build two identical datasets, union them, save, and assert the final schema key order
tests/unit/lib/test_datachain.py
Refactor _validate_columns to maintain column order
  • Change return type from set[str] to list[str] for ordered output
  • Collect left and right column names as lists instead of sets
  • Compare sorted name lists for equality to detect matching schemas
  • Use sets derived from the lists to compute missing columns when schemas differ
src/datachain/query/dataset.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an issue from a review comment by replying to it. You can also reply to a review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull request title to generate a title at any time. You can also comment @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in the pull request body to generate a PR summary at any time exactly where you want it. You can also comment @sourcery-ai summary on the pull request to (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the pull request to resolve all Sourcery comments. Useful if you've already addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull request to dismiss all existing Sourcery reviews. Especially useful if you want to start fresh with a new review - don't forget to comment @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

  • Contact our support team for questions or feedback.
  • Visit our documentation for detailed guides and information.
  • Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai[bot] avatar Oct 13 '25 12:10 sourcery-ai[bot]

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: a8eaa6a
Status: ✅  Deploy successful!
Preview URL: https://fd7c22b9.datachain-documentation.pages.dev
Branch Preview URL: https://ilongin-12188-union-schema-c.datachain-documentation.pages.dev

View logs

Codecov Report

:x: Patch coverage is 42.85714% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/datachain/query/dataset.py 42.85% 3 Missing and 1 partial :warning:

:loudspeaker: Thoughts on this report? Let us know!

codecov[bot] avatar Oct 13 '25 12:10 codecov[bot]

Why does it fix the issue, could you explain it please?

Specifically, I don't quite understand why would columns we have in select / subquery affect the signal schema that we have attached to the chain /query. Is it schema that defines the order / signals, etc? Or is it done in some other different way?

Schema is derived directly from selected columns from built SQLAlchemy query. Note that this is flatten schema (e.g it has keys like file__path, file__size etc.). We also have feature_schema which has higher level objects defined like file: File etc but that one is not important in this issue.

At some point in SQLUnion logic we were validating and constructing columns for union and in the process of validating we were using set which broke the original order.

ilongin avatar Oct 13 '25 22:10 ilongin

Schema is derived directly from selected columns from built SQLAlchemy query.

could you point me to it please?

shcheklein avatar Oct 13 '25 22:10 shcheklein

Schema is derived directly from selected columns from built SQLAlchemy query.

could you point me to it please?

  1. Columns created out of query and sent to create_dataset() -> https://github.com/iterative/datachain/blob/main/src/datachain/query/dataset.py#L1908-L1928
  2. Columns used to create schema in create_dataset() -> https://github.com/iterative/datachain/blob/main/src/datachain/catalog/catalog.py#L834-L836

ilongin avatar Oct 13 '25 22:10 ilongin

that's really weird, why aren't we using signal schema? it feels it can be tricky to preserve and guarantee order of columns in all these subqueries and selects ...

shcheklein avatar Oct 13 '25 22:10 shcheklein

that's really weird, why aren't we using signal schema? it feels it can be tricky to preserve and guarantee order of columns in all these subqueries and selects ...

I've added separate PR where I'm experimenting with using actual signals schema for calculating this, as it's not super simple it seems (need more time fixing that and making sure it's not breaking) https://github.com/iterative/datachain/pull/1404

ilongin avatar Oct 14 '25 10:10 ilongin

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Updated (UTC)
✅ Deployment successful!
View logs
datachain-docs 7355c001 Dec 01 2025, 10:09 PM