databend icon indicating copy to clipboard operation
databend copied to clipboard

chore: tweak transient table data retention settings

Open dantengsky opened this issue 10 months ago โ€ข 4 comments

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Tweak transient table data retention settings

This PR introduces a new setting, transient_data_retention_time_in_minutes, to customize the retention period for transient table. This setting defines how long the historical data should be retained, with a default value of 60 minutes (i.e. 1 hour).

Additionally, when purging data from transient tables, the retention period specified by transient_data_retention_time_in_minutes will now be utilized.

Set transient_data_retention_time_in_minutes to 0 will "restore" the behavior of transient table before this PR.

  • Fixes #[Link the issue here]

Tests

  • [ ] Unit Test
  • [ ] Logic Test
  • [ ] Benchmark Test
  • [x] No Test

Type of change

  • [ ] Bug Fix (non-breaking change which fixes an issue)
  • [x] New Feature (non-breaking change which adds functionality)
  • [ ] Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • [ ] Documentation Update
  • [ ] Refactoring
  • [ ] Performance Improvement
  • [ ] Other (please describe):

This change isโ€‚Reviewable

dantengsky avatar Apr 26 '24 07:04 dantengsky

@dantengsky Hi~ I think this feature should be useful when doing real-time scenarios to avoid the growth of snapshot files. What is the progress so far, please?

cdmikechen avatar Jun 10 '24 13:06 cdmikechen

@dantengsky Hi~ I think this feature should be useful when doing real-time scenarios to avoid the growth of snapshot files. What is the progress so far, please?

Thanks for asking!

This PR aims to use a more conservative (longer) retention period when purging history for transient tables, instead of the current value of "0". Once merged, this should mean that transient tables will keep more historical data by default than they do now.

Currently, the smallest unit for the retention period is a day, which is a bit too large for transient tables.


Right now, the way transient table purging taking a risk of corrupting the target table in scenarios with concurrent modifications (including append-only writes). Basically, it might purge data from pending transactions that might be successfully committed later.

Although this PR can mitigate the issue for now, it doesn't completely solve it. We need to further refine it (by checking the table's least visible timestamp at commit time) to fully fix the problem.

dantengsky avatar Jun 12 '24 07:06 dantengsky

Why not save the settings into table option rather than a dynamic global setting.

create table t (c int) 
row_per_block = 100000
block_per_segment = 1000
data_retention_ttl_minutes = 600    --- this could be respected by vacuum command
recluster_schedule_interval = ..

...

sundy-li avatar Jun 12 '24 08:06 sundy-li

Why not save the settings into table option rather than a dynamic global setting. ....

Good idea, at least data_retention_time_in_days should be able to adjustable at table level (or inherit from db, account)

dantengsky avatar Jun 13 '24 05:06 dantengsky