Fix union breaking schema order
Union was breaking schema order mixing signals from multiple objects (e.g File). This PR fixes this issue.
Related Studio issue: https://github.com/iterative/studio/issues/12188
Reviewer's Guide
Refactors the internal column validation logic to preserve original column ordering during dataset union operations and adds a unit test to ensure the schema order remains consistent after union.
Class diagram for updated _validate_columns function
classDiagram
class _validate_columns {
+left_columns: Iterable[ColumnElement]
+right_columns: Iterable[ColumnElement]
+return: list[str]
}
class ColumnElement {
+name: str
}
_validate_columns --> ColumnElement: uses
Flow diagram for column validation and schema order preservation
flowchart TD
A["left_columns (Iterable[ColumnElement])"] --> B["Extract left_names (list)"]
C["right_columns (Iterable[ColumnElement])"] --> D["Extract right_names (list)"]
B --> E["Sort left_names"]
D --> F["Sort right_names"]
E --> G["Compare sorted left_names and right_names"]
F --> G
G -- "If equal" --> H["Return left_names"]
G -- "If not equal" --> I["Compute missing columns"]
I --> J["Prepare error message"]
File-Level Changes
| Change | Details | Files |
|---|---|---|
| Add test to verify schema order is preserved during union |
|
tests/unit/lib/test_datachain.py |
| Refactor _validate_columns to maintain column order |
|
src/datachain/query/dataset.py |
Tips and commands
Interacting with Sourcery
- Trigger a new review: Comment
@sourcery-ai reviewon the pull request. - Continue discussions: Reply directly to Sourcery's review comments.
- Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with
@sourcery-ai issueto create an issue from it. - Generate a pull request title: Write
@sourcery-aianywhere in the pull request title to generate a title at any time. You can also comment@sourcery-ai titleon the pull request to (re-)generate the title at any time. - Generate a pull request summary: Write
@sourcery-ai summaryanywhere in the pull request body to generate a PR summary at any time exactly where you want it. You can also comment@sourcery-ai summaryon the pull request to (re-)generate the summary at any time. - Generate reviewer's guide: Comment
@sourcery-ai guideon the pull request to (re-)generate the reviewer's guide at any time. - Resolve all Sourcery comments: Comment
@sourcery-ai resolveon the pull request to resolve all Sourcery comments. Useful if you've already addressed all the comments and don't want to see them anymore. - Dismiss all Sourcery reviews: Comment
@sourcery-ai dismisson the pull request to dismiss all existing Sourcery reviews. Especially useful if you want to start fresh with a new review - don't forget to comment@sourcery-ai reviewto trigger a new review!
Customizing Your Experience
Access your dashboard to:
- Enable or disable review features such as the Sourcery-generated pull request summary, the reviewer's guide, and others.
- Change the review language.
- Add, remove or edit custom review instructions.
- Adjust other review settings.
Getting Help
- Contact our support team for questions or feedback.
- Visit our documentation for detailed guides and information.
- Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.
Deploying datachain-documentation with
Cloudflare Pages
| Latest commit: |
a8eaa6a
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://fd7c22b9.datachain-documentation.pages.dev |
| Branch Preview URL: | https://ilongin-12188-union-schema-c.datachain-documentation.pages.dev |
Codecov Report
:x: Patch coverage is 42.85714% with 4 lines in your changes missing coverage. Please review.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| src/datachain/query/dataset.py | 42.85% | 3 Missing and 1 partial :warning: |
:loudspeaker: Thoughts on this report? Let us know!
Why does it fix the issue, could you explain it please?
Specifically, I don't quite understand why would columns we have in
select/subqueryaffect the signal schema that we have attached to the chain /query. Is it schema that defines the order / signals, etc? Or is it done in some other different way?
Schema is derived directly from selected columns from built SQLAlchemy query. Note that this is flatten schema (e.g it has keys like file__path, file__size etc.). We also have feature_schema which has higher level objects defined like file: File etc but that one is not important in this issue.
At some point in SQLUnion logic we were validating and constructing columns for union and in the process of validating we were using set which broke the original order.
Schema is derived directly from selected columns from built SQLAlchemy query.
could you point me to it please?
Schema is derived directly from selected columns from built SQLAlchemy query.
could you point me to it please?
- Columns created out of query and sent to
create_dataset()-> https://github.com/iterative/datachain/blob/main/src/datachain/query/dataset.py#L1908-L1928 - Columns used to create schema in
create_dataset()-> https://github.com/iterative/datachain/blob/main/src/datachain/catalog/catalog.py#L834-L836
that's really weird, why aren't we using signal schema? it feels it can be tricky to preserve and guarantee order of columns in all these subqueries and selects ...
that's really weird, why aren't we using signal schema? it feels it can be tricky to preserve and guarantee order of columns in all these subqueries and selects ...
I've added separate PR where I'm experimenting with using actual signals schema for calculating this, as it's not super simple it seems (need more time fixing that and making sure it's not breaking) https://github.com/iterative/datachain/pull/1404
Deploying with
Cloudflare Workers
The latest updates on your project. Learn more about integrating Git with Workers.
| Status | Name | Latest Commit | Updated (UTC) |
|---|---|---|---|
| ✅ Deployment successful! View logs |
datachain-docs | 7355c001 | Dec 01 2025, 10:09 PM |