materialize icon indicating copy to clipboard operation
materialize copied to clipboard

Develop coherent strategy for exposing historical data in internal tables

Open umanwizard opened this issue 1 year ago • 10 comments

It has become clear that the pattern of using internal tables as a historical record of events is common in practice. For example, the historical record of resource usage metrics, a log of sink and source errors, and a log of cluster status changes (OOMs, crashes, etc.) either are already available in the web console or will be soon.

This historical data is exposed in at least two ways today:

  1. Only the most recent value is kept in the current version of the table, and the table's retention window is increased to the desired value (currently 30 days). Users must use time-travel queries (SELECT ... AS OF, SUBSCRIBE ... AS OF, etc.). The user should consider the differential-dataflow timestamp field to be the timestamp of the event or measurement.
  2. The historical log of values is available in the current version of the table, and the timestamp is stored as a field of the data. The user may use normal default-timestamp queries to inspect the historical record.

The main advantages of the second approach are that it does not require knowledge of exotic features in order to use, and it also allows further SQL transformation of the result set including timestamps (e.g., bucketing), whereas the first approach requires such logic to be done on the client. The main advantage of the first approach is that it doesn't require ad-hoc compaction/cleanup logic (since Differential already does this for us). Furthermore, any machinery developed to support this approach can be straightforwardly extended to help real customers/users that have workflows that depend on historical queries. One possible mitigation of the difficulties with approach 1 is a CHANGES operator that would surface the changes to a relation as a relation that could then be further transformed. I am not sure whether we will ever have this feature or how difficult it is to implement.

This issue is intended to centralize discussion about these two methods; the ultimate goal is to settle on one to be used throughout the codebase.

umanwizard avatar Apr 18 '23 17:04 umanwizard