databend feat(query): new implementation of analyze table

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Analyze command will merge increment blocks into the hyperloglog state of table_statistics file
Support querying incrementable blocks of fuse table (tuple rows maybe duplicated).

SELECT ...
FROM <fuse_table>
[ AT ( { SNAPSHOT => <snapshot_id> | TIMESTAMP => <timestamp> } ) ] 
[ SINCE ( { SNAPSHOT => <snapshot_id> | TIMESTAMP => <timestamp> } ) ];

eg:

databend-local:) insert into abc select * from abc_random limit 3;
3 rows written in 0.031 sec. Processed 3 rows, 3 B (95.5 rows/s, 3.54 KiB/s)

databend-local:) select * from abc since(snapshot => '045e8ed9233245e692b8782039a2a504');
┌────────────────────────────────────────────────────────┐
│      a     │      b      │       c      │       d      │
│ Int32 NULL │ String NULL │ Boolean NULL │ Float64 NULL │
├────────────┼─────────────┼──────────────┼──────────────┤
│ NULL       │ NULL        │ NULL         │ 0.0744611682 │
│ NULL       │ NULL        │ NULL         │ 0.5569072503 │
│ 1621677414 │ NULL        │ true         │ NULL         │
└────────────────────────────────────────────────────────┘
3 rows result in 0.041 sec. Processed 3 rows, 3 B (73.05 rows/s, 2.09 KiB/s)

Todo in future:

Considering about the mutations, we will introduce a healthy ratio in snapshot, if this is too low, it's worth doing full table analyze to override the stats.

Fixes #[Link the issue here]

Tests

[ ] Unit Test
[x] Logic Test
[ ] Benchmark Test
[ ] No Test - Explain why

Type of change

[ ] Bug Fix (non-breaking change which fixes an issue)
[x] New Feature (non-breaking change which adds functionality)
[ ] Breaking Change (fix or feature that could cause existing functionality not to work as expected)
[ ] Documentation Update
[x] Refactoring
[ ] Performance Improvement
[ ] Other (please describe):

This change is

Feb 23 '24 15:02 sundy-li

what is the "semantic" of query:

SELECT ... FROM <fuse_table> [ AT ( { SNAPSHOT => <snapshot_id> | TIMESTAMP => <timestamp> } ) ] [ SINCE ( { SNAPSHOT => <snapshot_id> | TIMESTAMP => <timestamp> } ) ];

e.g.

will query select * from abc since(snapshot => '045e8ed9233245e692b8782039a2a504'); return data updated since 045e8ed9233245e692b8782039a2a504 ?

Feb 25 '24 13:02 dantengsky

will query select * from abc since(snapshot => '045e8ed9233245e692b8782039a2a504'); return data updated since 045e8ed9233245e692b8782039a2a504 ?

Yes, it's similar to Stream table with change type of ChangeType::Insert.

The result is not the accurate (it did not conside about the intersection between blocks).

But for merging HLL, duplicate data does not have many side effects.

Feb 25 '24 13:02 sundy-li

can add some data about DISTINCT_ERROR_RATE and increased disk size when add filelds about hll?

Feb 27 '24 01:02 lichuang

can add some data about DISTINCT_ERROR_RATE and increased disk size when add filelds about hll?

Current we are using rate = '0.01625', with P = 12 , it's register size is 2**12 = 4k, the average compressed size could be 1k.

So it will take 100KB for 100 columns in the statistics file. This file is only generated in analyze statement, which do not affect the insert query.

Feb 27 '24 01:02 sundy-li

databend databend copied to clipboard

feat(query): new implementation of analyze table

Summary

Tests

Type of change

databend
databend copied to clipboard