ClickHouse icon indicating copy to clipboard operation
ClickHouse copied to clipboard

Add new features in schema inference

Open Avogar opened this issue 2 years ago • 2 comments

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Add new settings to control schema inference from text formats:

  • input_format_try_infer_dates - try infer dates from strings.
  • input_format_try_infer_datetimes - try infer datetimes from strings.
  • input_format_try_infer_integers - try infer Int64 instead of Float64.
  • input_format_json_try_infer_numbers_from_strings - try infer numbers from json strings in JSON formats.

All these settings are enabled by default.

Examples:

:) desc format(JSONEachRow, '{"date" : "2020-01-01"}') settings input_format_try_infer_dates=1;

┌─name─┬─type───────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ date │ Nullable(Date) │              │                    │         │                  │                │
└──────┴────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

:) desc format(JSONEachRow, '{"date" : "2020-01-01 19:00:00"}') settings input_format_try_infer_datetimes=1

┌─name─┬─type────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ date │ Nullable(DateTime64(9)) │              │                    │         │                  │                │
└──────┴─────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

:) desc format(JSONEachRow, '{"int" : 42}') settings input_format_try_infer_integers=1

┌─name─┬─type────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ int  │ Nullable(Int64) │              │                    │         │                  │                │
└──────┴─────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

:) desc format(JSONEachRow, '{"int" : "42"}') settings input_format_json_try_infer_numbers_from_strings=1, input_format_try_infer_integers=1

┌─name─┬─type────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ int  │ Nullable(Int64) │              │                    │         │                  │                │
└──────┴─────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

Avogar avatar Jul 13 '22 16:07 Avogar

Maybe we can also try to turn on some settings by default. @alexey-milovidov what do you think?

Avogar avatar Jul 13 '22 16:07 Avogar

Yes, let's do it. Let's turn on every setting that is mostly safe to use by default.

alexey-milovidov avatar Jul 13 '22 23:07 alexey-milovidov

Test failures are unrelated.

CurtizJ avatar Aug 10 '22 22:08 CurtizJ