PyAirbyte icon indicating copy to clipboard operation
PyAirbyte copied to clipboard

fix: Allow composite primary key overrides in PyAirbyte

Open aaronsteers opened this issue 4 months ago • 6 comments

fix: Allow composite primary keys in PyAirbyte

Summary

Fixed a bug in PyAirbyte's primary key validation that incorrectly rejected composite primary keys like ['id', 'category'] with the error "Nested primary keys are not supported. Each PK column should have exactly one node."

The validation logic in catalog_providers.py was checking if each primary key had exactly one column (len(pk_nodes) != 1), which blocked composite keys. The intent was to prevent nested field references like ['data.id'], not composite keys.

Changes made:

  • Modified get_primary_keys() validation in airbyte/shared/catalog_providers.py
  • Changed from checking column count to checking for dot notation (nested field indicators)
  • Composite keys like ['id', 'category'] now work correctly
  • Nested keys like ['data.id'] are still properly rejected

Review & Testing Checklist for Human

⚠️ Risk Level: Medium - Core validation logic change affecting all PyAirbyte users

  • [ ] Verify validation logic is correct: Confirm that any("." in node for node in pk_nodes) properly distinguishes between composite keys (['id', 'category']) and nested keys (['data.id'])
  • [ ] Test with real connectors: Try setting composite primary keys on actual source connectors to ensure no regressions in production workflows
  • [ ] Edge case testing: Test with keys containing special characters, numbers, or unusual naming patterns to ensure robustness
  • [ ] Integration test coverage: Check if integration tests need updates to cover composite primary key scenarios

Recommended test plan:

  1. Run the reproduction script with MotherDuck API key to verify the original issue is resolved
  2. Test composite primary keys with 2-3 different source connectors
  3. Verify single-column primary keys still work normally
  4. Test that truly nested keys (with dots) are still rejected appropriately

Diagram

%%{ init : { "theme" : "default" }}%%
graph TB
    repro["reproduce_null_pk_issue.py<br/>(test script)"]:::context
    source_base["airbyte/sources/base.py<br/>set_primary_keys()"]:::context
    catalog_providers["airbyte/shared/catalog_providers.py<br/>get_primary_keys()"]:::major-edit
    test_overrides["tests/unit_tests/sources/<br/>test_source_key_overrides.py"]:::context
    
    repro -->|"calls set_primary_keys(['id', 'category'])"| source_base
    source_base -->|"stores as _primary_key_overrides"| catalog_providers
    catalog_providers -->|"validates primary key format"| test_overrides
    
    catalog_providers -->|"OLD: len(pk_nodes) != 1<br/>NEW: any('.' in node)"| catalog_providers
    
    subgraph Legend
        L1[Major Edit]:::major-edit
        L3[Context/No Edit]:::context
    end

classDef major-edit fill:#90EE90
classDef context fill:#FFFFFF

Notes

This fix resolves the issue reported by AJ Steers where composite primary keys were incorrectly blocked by PyAirbyte's validation. The change is minimal (2 lines) but affects a core validation function used throughout the system.

Key validation change:

  • Before: if len(pk_nodes) != 1: (rejected any multi-column key)
  • After: if any("." in node for node in pk_nodes): (only rejects nested field references)

All existing unit tests pass, including the specific composite primary key test that was previously failing. The reproduction script now successfully processes composite primary keys with both DuckDB and Snowflake destinations.

Session details: Requested by AJ Steers (@aaronsteers) in Devin session: https://app.devin.ai/sessions/bc5912d0a95e499dab8e86e329f1face

Summary by CodeRabbit

  • Bug Fixes
    • Improved validation for primary keys to ensure only top-level fields are accepted, providing clearer error messages when nested fields are used.

aaronsteers avatar Aug 01 '25 17:08 aaronsteers