feat(import): validate primary keys early
NOTE: currently this is just checking primary keys, but the related issue also mentioned in #1083 references issues with number of args. I'm wondering if I should split that into two separate PRs, or if it's fine to do both - currently I only have the PK validation in place? Personally I would argue to separate them since they are slightly different, albeit similar. But happy to take either path.
Add early validation to check if specified primary keys exist in the import file's schema before processing rows. This prevents users from waiting for large files to be processed only to discover that their primary key column names are invalid.
Changes:
- Add validatePrimaryKeysAgainstSchema function to check primary key existence against file schema
- Integrate validation into newImportDataReader for create operations
- Provide helpful error messages listing available columns when primary keys are not found
- Add unit tests covering various validation scenarios
- Add BATS integration tests for CSV, PSV, and large file scenarios
The validation only runs for create operations when primary keys are explicitly specified and no schema file is provided. This ensures fast failure while maintaining backward compatibility.
Before: Users waited minutes for large files to process before seeing "provided primary key not found" errors
After: Users get immediate feedback with helpful column suggestions
Closes: #1083
A quick once over of the
This change has broken four existing bats tests.
not ok 954 import-create-tables: try to table import with nonexistent --pk arg
# (in test file ./import-create-tables.bats, line 257)
# `[[ "$output" =~ "column 'batmansparents' not found" ]] || false' failed
# Successfully initialized dolt data repository.
not ok 955 import-create-tables: try to table import with one valid and one nonexistent --pk arg
# (in test file ./import-create-tables.bats, line 264)
# `[[ "$output" =~ "column 'batmansparents' not found" ]] || false' failed
# Successfully initialized dolt data repository.
ok 956 import-create-tables: create a table with two primary keys from csv import
ok 957 import-create-tables: import data from psv and create the table
ok 958 import-create-tables: import table using --delim
not ok 959 import-create-tables: create a table with a name map
# (in test file ./import-create-tables.bats, line 313)
# `[ "$status" -eq 0 ]' failed
# Successfully initialized dolt data repository.
not ok 960 import-create-tables: use a name map with missing and extra entries
# (in test file ./import-create-tables.bats, line 331)
# `[ "$status" -eq 0 ]' failed
And your two new tests fail:
not ok 1040 import-tables: validate primary keys exist in CSV file (issue #1083)
# (in test file ./import-tables.bats, line 69)
# `[[ "$output" =~ "primary keys \[invalid1 invalid2\] not found" ]] || false' failed
# Successfully initialized dolt data repository.
not ok 1041 import-tables: primary key validation with schema file should skip validation
# (in test file ./import-tables.bats, line 152)
# `[ "$status" -eq 0 ]' failed
These need to be fixed before merge.
Thanks @timsehn, I'll fix those. Should I add more validations in this PR or just stick with primary key validation for now and break up other checks (like the related issue referenced in the issue regarding number of args) into another PR?
I think this is fine as an isolated change. I would make new PRs for additional checks.
Hey just bumping this up just in case y'all missed my replies/comments
Thank you! I just pushed the commit resolving your suggestions. Let me know if I missed anything.
Thank you!