TileDB icon indicating copy to clipboard operation
TileDB copied to clipboard

Add back config to toggle the preservation of timestamps in consolidated fragments

Open ypatia opened this issue 7 months ago • 4 comments

https://github.com/TileDB-Inc/TileDB/pull/3267 erroneously removed the option to perform consolidation without timestamps [sc-18605]. This PR restores that config as the option to not retain timestamps when consolidating is still a valuable one in many cases :

  • heavily modified datasets (many updates) which want to reclaim space and do not need time-travel
  • code that wants to benefit from cell count metadata on the fragments (and needs no-dups semantics)
  • the potential future ability to improve read performance of no-dup arrays by optimizing the reader (e.g., using the sparse allows-dups reader)

Note: I have left the default configuration as "with timestamps", but I have also implemented and tested changing the default to "without timestamps" and everything works fine. I can toggle the default to whatever the requirement is very easily. I just don't know what the requirement is :)

Fixes [CORE-134]


TYPE: IMPROVEMENT DESC: Add back config to toggle the preservation of timestamps in consolidated fragments

ypatia avatar May 21 '25 14:05 ypatia

Thanks!

I believe we should make a plan to disable this feature by default and eventually remove it. As a concept, consolidation is fundamentally incompatible with time travelling and adding a timestamps pseudo-attribute was a failed attempt to reconcile them.

I don't know the background story on why this feature was introduced in the first place, I can imagine though that it was requested to meet some need. If the way it has been implemented is optimal or not is another story. However, I agree in principle that consolidation means mostly I don't care about time-traveling, that's why I was wondering and asked to clarify what the default value of this config toggle should be. If 95% of the cases we don't want to have timestamps in consolidated fragments then it'd make sense to change the default value to false. cc @ihnorton

ypatia avatar May 27 '25 08:05 ypatia

Thanks! I believe we should make a plan to disable this feature by default and eventually remove it. As a concept, consolidation is fundamentally incompatible with time travelling and adding a timestamps pseudo-attribute was a failed attempt to reconcile them.

I don't know the background story on why this feature was introduced in the first place, I can imagine though that it was requested to meet some need. If the way it has been implemented is optimal or not is another story. However, I agree in principle that consolidation means mostly I don't care about time-traveling, that's why I was wondering and asked to clarify what the default value of this config toggle should be. If 95% of the cases we don't want to have timestamps in consolidated fragments then it'd make sense to change the default value to false. cc @ihnorton

Consolidation is not only a question of data retention, but also data management. There might be several reasons that DBAs would prefer to have one larger consolidated fragment versus a larger number of non-consolidated smaller fragments. I am not a DBA so I can't really enumerate them.

From the perspective of a query engine there is a difference:

If you have a query which wants half of your fragments, and you are not consolidated, then you have to merge coordinates from all of the fragments.

If you have a query which wants half of your fragments, and you are consolidated, then instead you have to de-duplicate a single stream on the max timestamp per coordinate.

I would expect merge to be a lot more resource-intensive than a single-stream de-duplicate. Imagine the effort required to parallelize them, for example.

If 95% of the cases we don't want to have timestamps in consolidated fragments then it'd make sense to change the default value to false

I'm a bit leery of this. The upshot of false is performance; the downside is what a customer doing the wrong thing would perceive as data loss.

rroelke avatar May 27 '25 18:05 rroelke

If you consolidate with timestamps, and then decide you want to purge old data, does it work to re-run consolidation in the "purge" mode?

rroelke avatar May 27 '25 18:05 rroelke

The underlying goal here is to optimize an array for efficient reads with time-traveling across the fragment history, and that remains a requirement which we will continue to support. The implementation may evolve or change to better realize the usage requirements.

ihnorton avatar May 28 '25 03:05 ihnorton