datachain icon indicating copy to clipboard operation
datachain copied to clipboard

Remove default value checks

Open ilongin opened this issue 10 months ago • 6 comments

Companion issue of https://github.com/iterative/studio/pull/11236 With making CH accept nullable columns, we no longer need special checks for default values in tests, as they will be None always.

ilongin avatar Jan 23 '25 00:01 ilongin

Codecov Report

:x: Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review. :white_check_mark: Project coverage is 87.71%. Comparing base (86c2cb1) to head (8f01a7a). :warning: Report is 300 commits behind head on main.

Files with missing lines Patch % Lines
src/datachain/data_storage/db_engine.py 0.00% 0 Missing and 1 partial :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #852      +/-   ##
==========================================
- Coverage   87.73%   87.71%   -0.02%     
==========================================
  Files         131      131              
  Lines       11804    11805       +1     
  Branches     1601     1601              
==========================================
- Hits        10356    10355       -1     
- Misses       1045     1047       +2     
  Partials      403      403              
Flag Coverage Δ
datachain 87.64% <66.66%> (-0.41%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Jan 23 '25 00:01 codecov[bot]

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 8f01a7a
Status: ✅  Deploy successful!
Preview URL: https://13705299.datachain-documentation.pages.dev
Branch Preview URL: https://ilongin-11161-clickhouse-nul.datachain-documentation.pages.dev

View logs

Can we introduce something like strict mode in schema, similar to pydantic to support non-nullable types?

https://docs.pydantic.dev/latest/concepts/strict_mode/

skshetry avatar Jan 28 '25 10:01 skshetry

Can we introduce something like strict mode in schema, similar to pydantic to support non-nullable types?

https://docs.pydantic.dev/latest/concepts/strict_mode/

The thing here is actually this is not null vs not null constraint, but making sure we don't rely on Clickhouse defaults when we insert empty value (None).
Regarding nulls, Clickhouse is syntactically compatible with other relational databases but not semantically which means that if we define NOT NULL constraint on a column (default behavior before this PR) it will not reject data that has empty value, instead it will put default value for that column type, e.g int -> 0. This is because CH has emphasized for performance and analytical use cases and checking every single value for constraint would be slow. Because of this, we currently have 2 different behaviors in SaaS vs local:

  1. In SQLite we allow nulls everywhere which behaves as expected
  2. In Clickhouse we don't allow nulls anywhere, but CH behaves in a way to put default values instead

The simplest solution is to make all columns nullable in CH, which is what this (and Studio companion) PR is up to, as having those default values is not correct and expected from user perspective. The only drawback is performance implication as it could slow down queries by double, according to what people are saying.

Your suggestion goes one step further -> adding a new feature to the system for setting NULL constraints in user model fields. This is easy to do in SQLite, but in Clickhouse we need to add special constraints or maybe this setting although it seems like it's global and we need it for specific columns

Regarding the joins, as they were also mentioned multiple times in our calls, we have special global setting join_use_nulls set to 1 which means CH will convert all columns which get empty value in joins (e.g. after using outer join) to nullable (same as Studio companion PR is doing to ALL columns being used in CH).

ilongin avatar Feb 04 '25 14:02 ilongin

@ilongin what is the status of this?

shcheklein avatar Feb 25 '25 00:02 shcheklein

@ilongin what is the status of this?

@shcheklein It's ready for merge but I know we needed some more discussions / clarification in one of our meetings so I wrote this last comment (and put a link in slack). I would like to hear another opinion / confirmation before merging it.

ilongin avatar Feb 25 '25 09:02 ilongin