databend
databend copied to clipboard
feat(query): new implementation of analyze table
I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/
Summary
- Analyze command will merge increment blocks into the hyperloglog state of table_statistics file
- Support querying incrementable blocks of fuse table (tuple rows maybe duplicated).
SELECT ...
FROM <fuse_table>
[ AT ( { SNAPSHOT => <snapshot_id> | TIMESTAMP => <timestamp> } ) ]
[ SINCE ( { SNAPSHOT => <snapshot_id> | TIMESTAMP => <timestamp> } ) ];
eg:
databend-local:) insert into abc select * from abc_random limit 3;
3 rows written in 0.031 sec. Processed 3 rows, 3 B (95.5 rows/s, 3.54 KiB/s)
databend-local:) select * from abc since(snapshot => '045e8ed9233245e692b8782039a2a504');
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ a โ b โ c โ d โ
โ Int32 NULL โ String NULL โ Boolean NULL โ Float64 NULL โ
โโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ NULL โ NULL โ NULL โ 0.0744611682 โ
โ NULL โ NULL โ NULL โ 0.5569072503 โ
โ 1621677414 โ NULL โ true โ NULL โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
3 rows result in 0.041 sec. Processed 3 rows, 3 B (73.05 rows/s, 2.09 KiB/s)
Todo in future:
- Considering about the mutations, we will introduce a healthy ratio in snapshot, if this is too low, it's worth doing full table analyze to override the stats.
- Fixes #[Link the issue here]
Tests
- [ ] Unit Test
- [x] Logic Test
- [ ] Benchmark Test
- [ ] No Test - Explain why
Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Breaking Change (fix or feature that could cause existing functionality not to work as expected)
- [ ] Documentation Update
- [x] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
what is the "semantic" of query:
SELECT ... FROM <fuse_table> [ AT ( { SNAPSHOT => <snapshot_id> | TIMESTAMP => <timestamp> } ) ] [ SINCE ( { SNAPSHOT => <snapshot_id> | TIMESTAMP => <timestamp> } ) ];
e.g.
will query
select * from abc since(snapshot => '045e8ed9233245e692b8782039a2a504');
return data updated since 045e8ed9233245e692b8782039a2a504
?
will query
select * from abc since(snapshot => '045e8ed9233245e692b8782039a2a504');
return data updated since045e8ed9233245e692b8782039a2a504
?
Yes, it's similar to Stream table with change type of ChangeType::Insert
.
The result is not the accurate (it did not conside about the intersection between blocks).
But for merging HLL, duplicate data does not have many side effects.
can add some data about DISTINCT_ERROR_RATE and increased disk size when add filelds about hll?
can add some data about DISTINCT_ERROR_RATE and increased disk size when add filelds about hll?
Current we are using rate = '0.01625', with P = 12 , it's register size is 2**12
= 4k, the average compressed size could be 1k.
So it will take 100KB
for 100 columns
in the statistics file. This file is only generated in analyze statement, which do not affect the insert query.