starrocks icon indicating copy to clipboard operation
starrocks copied to clipboard

[Enhancement] Allow disable type auto-detection on CSV Files

Open rad-pat opened this issue 2 weeks ago • 10 comments

Why I'm doing:

When importing CSV, without sampling the whole file, it is possible to get type errors when inserting from FILES. Workarounds include setting the conflicting data to null, which is not ideal.

What I'm doing:

This PR allows the user to disable the type auto-detection and return all columns as string which can then be manipulated with SQL functions to obtain the desired data type.

Fixes #66473

What type of PR is this:

  • [ ] BugFix
  • [ ] Feature
  • [x] Enhancement
  • [ ] Refactor
  • [ ] UT
  • [ ] Doc
  • [ ] Tool

Does this PR entail a change in behavior?

  • [ ] Yes, this PR will result in a change in behavior.
  • [x] No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • [ ] Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • [ ] Parameter changes: default values, similar parameters but with different default values
  • [ ] Policy changes: use new policy to replace old one, functionality automatically enabled
  • [ ] Feature removed
  • [ ] Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • [x] I have added test cases for my bug fix or my new feature
  • [x] This pr needs user documentation (for new or modified features or behaviors)
    • [x] I have added documentation for my new feature or new function
  • [ ] This is a backport pr

Bugfix cherry-pick branch check:

  • [x] I have checked the version labels which the pr will be auto-backported to the target branch
    • [x] 4.0
    • [x] 3.5
    • [x] 3.4
    • [x] 3.3

[!NOTE] Adds a auto_detect_types CSV option to bypass type inference (treat sampled columns as STRING), wired FE→Thrift→BE with tests and docs.

  • Schema detection (CSV):
    • Add auto_detect_types property in FILES() to control type inference (default true).
    • If false, BE returns VARCHAR for all sampled columns.
  • Frontend (FE):
    • Parse and validate auto_detect_types in TableFunctionTable and pass via TBrokerScanRangeParams.schema_sample_types.
  • Thrift:
    • Add TBrokerScanRangeParams.schema_sample_types (default true).
  • Backend (BE):
    • Update CSVScanner schema sampling to respect schema_sample_types.
  • Tests:
    • FE: unit tests for auto_detect_types parsing.
    • BE: schema tests for both enabled/disabled cases; add type_sniff.csv fixture.
  • Docs:
    • EN/JA/ZH: document auto_detect_types and behavior when disabled.

Written by Cursor Bugbot for commit 88e6cce0b605a7bde467c7aff08a7138aeb90c6e. This will update automatically on new commits. Configure here.

rad-pat avatar Dec 09 '25 09:12 rad-pat

🧪 CI Insights

Here's what we observed from your CI run for 88e6cce0.

🟢 All jobs passed!

But CI Insights is watching 👀

mergify[bot] avatar Dec 09 '25 09:12 mergify[bot]

@cursor review

alvin-celerdata avatar Dec 09 '25 17:12 alvin-celerdata

@cursor review

alvin-celerdata avatar Dec 10 '25 17:12 alvin-celerdata

@cursor review

alvin-celerdata avatar Dec 11 '25 00:12 alvin-celerdata

@cursor review

alvin-celerdata avatar Dec 11 '25 15:12 alvin-celerdata

[Java-Extensions Incremental Coverage Report]

:white_check_mark: pass : 0 / 0 (0%)

github-actions[bot] avatar Dec 11 '25 23:12 github-actions[bot]

[FE Incremental Coverage Report]

:white_check_mark: pass : 10 / 10 (100.00%)

file detail

path covered_line new_line coverage not_covered_line_detail
:large_blue_circle: com/starrocks/catalog/TableFunctionTable.java 10 10 100.00% []

github-actions[bot] avatar Dec 11 '25 23:12 github-actions[bot]

[BE Incremental Coverage Report]

:white_check_mark: pass : 7 / 7 (100.00%)

file detail

path covered_line new_line coverage not_covered_line_detail
:large_blue_circle: be/src/exec/file_scanner/csv_scanner.cpp 7 7 100.00% []

github-actions[bot] avatar Dec 11 '25 23:12 github-actions[bot]

@cursor review

alvin-celerdata avatar Dec 12 '25 04:12 alvin-celerdata

@rad-pat Overall looks good.

Need one step forward, if auto detect types is turned off, BE should skip the schema sampling completely.

@kevincai We still need to know the number of columns though, right?

rad-pat avatar Dec 13 '25 07:12 rad-pat

@rad-pat Overall looks good. Need one step forward, if auto detect types is turned off, BE should skip the schema sampling completely.

@kevincai We still need to know the number of columns though, right?

@rad-pat Yeah, you are right. Because of schema auto detection is disabled, so the sampling will be as simple as one line from any file. auto_detect_sample_rows can be overwritten to 1, auto_detect_sample_files can be overwritten to 1. This will reduce the overhead to minimal in case type detection is off.

kevincai avatar Dec 13 '25 11:12 kevincai

@kevincai , having thought about this further now, it probably only makes sense to sample just one row. There could be a case where user is loading CSV with sometimes less columns and if that file is sampled then all columns would not be available.

e.g. CSV1, four columns

a,b,c,d
1,2,3,4

CSV2, six columns

a,b,c,d,e,f
1,2,3,4,5,6

if CSV1 sampled then only get 4 columns available, if CSV2 is sampled then get all six columns. If both files sampled, then should get all six columns also. So, probably best to let user decide how many files should be sampled, but only sample one row from each - does that make sense?

rad-pat avatar Dec 15 '25 10:12 rad-pat

@kevincai We still need to know the number of columns though, right?

@rad-pat Yeah, you are right. Because of schema auto detection is disabled, so the sampling will be as simple as one line from any file. auto_detect_sample_rows can be overwritten to 1, auto_detect_sample_files can be overwritten to 1. This will reduce the overhead to minimal in case type detection is off.

Actually, maybe all should be left for user to set. Sampling more rows determines if column may be nullable, sampling more files to get correct columns.

rad-pat avatar Dec 15 '25 11:12 rad-pat

@kevincai @wyb Please, any further comments - I would like to get this merged if possible so we may start testing against Starrocks with this change in. My feeling is that the user should decide on the auto_detect_sample_rows and auto_detect_sample_files - sampling just one row from one file may not be ideal for determining nullable, or correct number of columns.

rad-pat avatar Dec 16 '25 10:12 rad-pat