dolt feat(import): validate primary keys early

NOTE: currently this is just checking primary keys, but the related issue also mentioned in #1083 references issues with number of args. I'm wondering if I should split that into two separate PRs, or if it's fine to do both - currently I only have the PK validation in place? Personally I would argue to separate them since they are slightly different, albeit similar. But happy to take either path.

Add early validation to check if specified primary keys exist in the import file's schema before processing rows. This prevents users from waiting for large files to be processed only to discover that their primary key column names are invalid.

Changes:

Add validatePrimaryKeysAgainstSchema function to check primary key existence against file schema
Integrate validation into newImportDataReader for create operations
Provide helpful error messages listing available columns when primary keys are not found
Add unit tests covering various validation scenarios
Add BATS integration tests for CSV, PSV, and large file scenarios

The validation only runs for create operations when primary keys are explicitly specified and no schema file is provided. This ensures fast failure while maintaining backward compatibility.

Before: Users waited minutes for large files to process before seeing "provided primary key not found" errors

After: Users get immediate feedback with helpful column suggestions

Closes: #1083

Jun 13 '25 06:06 codeaucafe

A quick once over of the

This change has broken four existing bats tests.

not ok 954 import-create-tables: try to table import with nonexistent --pk arg
# (in test file ./import-create-tables.bats, line 257)
#   `[[ "$output" =~ "column 'batmansparents' not found" ]] || false' failed
# Successfully initialized dolt data repository.
not ok 955 import-create-tables: try to table import with one valid and one nonexistent --pk arg
# (in test file ./import-create-tables.bats, line 264)
#   `[[ "$output" =~ "column 'batmansparents' not found" ]] || false' failed
# Successfully initialized dolt data repository.
ok 956 import-create-tables: create a table with two primary keys from csv import
ok 957 import-create-tables: import data from psv and create the table
ok 958 import-create-tables: import table using --delim
not ok 959 import-create-tables: create a table with a name map
# (in test file ./import-create-tables.bats, line 313)
#   `[ "$status" -eq 0 ]' failed
# Successfully initialized dolt data repository.
not ok 960 import-create-tables: use a name map with missing and extra entries
# (in test file ./import-create-tables.bats, line 331)
#   `[ "$status" -eq 0 ]' failed

And your two new tests fail:

not ok 1040 import-tables: validate primary keys exist in CSV file (issue #1083)
# (in test file ./import-tables.bats, line 69)
#   `[[ "$output" =~ "primary keys \[invalid1 invalid2\] not found" ]] || false' failed
# Successfully initialized dolt data repository.
not ok 1041 import-tables: primary key validation with schema file should skip validation
# (in test file ./import-tables.bats, line 152)
#   `[ "$status" -eq 0 ]' failed

These need to be fixed before merge.

Jun 13 '25 20:06 timsehn

Thanks @timsehn, I'll fix those. Should I add more validations in this PR or just stick with primary key validation for now and break up other checks (like the related issue referenced in the issue regarding number of args) into another PR?

Jun 13 '25 22:06 codeaucafe

I think this is fine as an isolated change. I would make new PRs for additional checks.

Jun 13 '25 22:06 timsehn

Hey just bumping this up just in case y'all missed my replies/comments

Jun 23 '25 22:06 codeaucafe

Thank you! I just pushed the commit resolving your suggestions. Let me know if I missed anything.

Jun 24 '25 06:06 codeaucafe

Thank you!

Jun 25 '25 00:06 codeaucafe