[Enhancement] Allow disable type auto-detection on CSV Files
Why I'm doing:
When importing CSV, without sampling the whole file, it is possible to get type errors when inserting from FILES. Workarounds include setting the conflicting data to null, which is not ideal.
What I'm doing:
This PR allows the user to disable the type auto-detection and return all columns as string which can then be manipulated with SQL functions to obtain the desired data type.
Fixes #66473
What type of PR is this:
- [ ] BugFix
- [ ] Feature
- [x] Enhancement
- [ ] Refactor
- [ ] UT
- [ ] Doc
- [ ] Tool
Does this PR entail a change in behavior?
- [ ] Yes, this PR will result in a change in behavior.
- [x] No, this PR will not result in a change in behavior.
If yes, please specify the type of change:
- [ ] Interface/UI changes: syntax, type conversion, expression evaluation, display information
- [ ] Parameter changes: default values, similar parameters but with different default values
- [ ] Policy changes: use new policy to replace old one, functionality automatically enabled
- [ ] Feature removed
- [ ] Miscellaneous: upgrade & downgrade compatibility, etc.
Checklist:
- [x] I have added test cases for my bug fix or my new feature
- [x] This pr needs user documentation (for new or modified features or behaviors)
- [x] I have added documentation for my new feature or new function
- [ ] This is a backport pr
Bugfix cherry-pick branch check:
- [x] I have checked the version labels which the pr will be auto-backported to the target branch
- [x] 4.0
- [x] 3.5
- [x] 3.4
- [x] 3.3
[!NOTE] Adds a
auto_detect_typesCSV option to bypass type inference (treat sampled columns as STRING), wired FE→Thrift→BE with tests and docs.
- Schema detection (CSV):
- Add
auto_detect_typesproperty inFILES()to control type inference (defaulttrue).- If
false, BE returnsVARCHARfor all sampled columns.- Frontend (FE):
- Parse and validate
auto_detect_typesinTableFunctionTableand pass viaTBrokerScanRangeParams.schema_sample_types.- Thrift:
- Add
TBrokerScanRangeParams.schema_sample_types(defaulttrue).- Backend (BE):
- Update
CSVScannerschema sampling to respectschema_sample_types.- Tests:
- FE: unit tests for
auto_detect_typesparsing.- BE: schema tests for both enabled/disabled cases; add
type_sniff.csvfixture.- Docs:
- EN/JA/ZH: document
auto_detect_typesand behavior when disabled.Written by Cursor Bugbot for commit 88e6cce0b605a7bde467c7aff08a7138aeb90c6e. This will update automatically on new commits. Configure here.
🧪 CI Insights
Here's what we observed from your CI run for 88e6cce0.
🟢 All jobs passed!
But CI Insights is watching 👀
@cursor review
@cursor review
@cursor review
@cursor review
Quality Gate passed
Issues
2 New issues
0 Accepted issues
Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code
[Java-Extensions Incremental Coverage Report]
:white_check_mark: pass : 0 / 0 (0%)
[FE Incremental Coverage Report]
:white_check_mark: pass : 10 / 10 (100.00%)
file detail
| path | covered_line | new_line | coverage | not_covered_line_detail | |
|---|---|---|---|---|---|
| :large_blue_circle: | com/starrocks/catalog/TableFunctionTable.java | 10 | 10 | 100.00% | [] |
[BE Incremental Coverage Report]
:white_check_mark: pass : 7 / 7 (100.00%)
file detail
| path | covered_line | new_line | coverage | not_covered_line_detail | |
|---|---|---|---|---|---|
| :large_blue_circle: | be/src/exec/file_scanner/csv_scanner.cpp | 7 | 7 | 100.00% | [] |
@cursor review
@rad-pat Overall looks good.
Need one step forward, if auto detect types is turned off, BE should skip the schema sampling completely.
@kevincai We still need to know the number of columns though, right?
@rad-pat Overall looks good. Need one step forward, if auto detect types is turned off, BE should skip the schema sampling completely.
@kevincai We still need to know the number of columns though, right?
@rad-pat Yeah, you are right. Because of schema auto detection is disabled, so the sampling will be as simple as one line from any file. auto_detect_sample_rows can be overwritten to 1, auto_detect_sample_files can be overwritten to 1. This will reduce the overhead to minimal in case type detection is off.
@kevincai , having thought about this further now, it probably only makes sense to sample just one row. There could be a case where user is loading CSV with sometimes less columns and if that file is sampled then all columns would not be available.
e.g. CSV1, four columns
a,b,c,d
1,2,3,4
CSV2, six columns
a,b,c,d,e,f
1,2,3,4,5,6
if CSV1 sampled then only get 4 columns available, if CSV2 is sampled then get all six columns. If both files sampled, then should get all six columns also. So, probably best to let user decide how many files should be sampled, but only sample one row from each - does that make sense?
@kevincai We still need to know the number of columns though, right?
@rad-pat Yeah, you are right. Because of schema auto detection is disabled, so the sampling will be as simple as one line from any file.
auto_detect_sample_rowscan be overwritten to1,auto_detect_sample_filescan be overwritten to1. This will reduce the overhead to minimal in case type detection is off.
Actually, maybe all should be left for user to set. Sampling more rows determines if column may be nullable, sampling more files to get correct columns.
@kevincai @wyb Please, any further comments - I would like to get this merged if possible so we may start testing against Starrocks with this change in. My feeling is that the user should decide on the auto_detect_sample_rows and auto_detect_sample_files - sampling just one row from one file may not be ideal for determining nullable, or correct number of columns.