materialize Suggest an index to the user when two dataflows are arranging the same thing

Feature request

If we have two active (edit: static) dataflows both arranging the same input, then we should suggest it to the user to index it. This could happen either by some continuous monitoring detecting this, or simply when the second dataflow gets created.

This was happening recently to a customer: https://materializeinc.slack.com/archives/C02PPB50ZHS/p1690371972579099?thread_ts=1690294151.692909&cid=C02PPB50ZHS

cc @ggnall

Edit: This could be considered a duplicate of https://github.com/MaterializeInc/materialize/issues/20878. I think originally I was thinking here to look only at inputs, while in the other issue to look at also larger plan fragments, but we might want to tackle these at the same time.

Blocked either by #20660 or https://github.com/MaterializeInc/materialize/issues/20652, see https://github.com/MaterializeInc/materialize/issues/20802#issuecomment-1678620791. Edit: https://github.com/MaterializeInc/materialize/issues/20652 is resolved now.

Note that we might want to do something similar not just for static dataflows but also for one-off selects, but I would consider that out of scope for this issue.

Edit 2: We shouldn't forget about the case when both arrangements are preceded by some non-trivial projections. Maybe in this case we should suggest also a projection, which would be the union of the existing ones.

Jul 26 '23 12:07 ggevay

(Currently blocked by the need to record query plans in the catalog.)

Aug 08 '23 12:08 ggevay

Blocked by: #20660

Aug 14 '23 21:08 ggnall

Yes, https://github.com/MaterializeInc/materialize/issues/20660 would be one way to unblock this. However, I'm thinking that instead we might want to build on @aalexandrov's work that will add query plans to the catalog (https://github.com/MaterializeInc/materialize/issues/20652). We'll have to think this through carefully whether that is indeed the better approach. Some possible issues/questions with building this on the query log:

We'd have to skip through a lot of one-off queries, which are irrelevant for this issue. Or will there be a fast way to ask the query log about only materialize views and indexes?
The query log might be sampled to some degree, so maybe we'd miss some relevant dataflows.
Will the query log contain very old queries, or will it have some retention period? Some relevant dataflow might have been installed a long time ago.

Aug 15 '23 08:08 ggevay

I see!

Why is this? Not following.

We'd have to skip through a lot of one-off queries, which are irrelevant for this issue

Not that I'm aware of. There was a recent request to add collection_ids, which may provide an index search? Thread @umanwizard

Or will there be a fast way to ask the query log about only materialize views and indexes?

Confirming this will be sampled.

The query log might be sampled to some degree, so maybe we'd miss some relevant dataflows.

Similarly, these will have a configurable retention period, e.g. 30 days.

Will the query log contain very old queries, or will it have some retention period?

Aug 15 '23 21:08 ggnall

We'd have to skip through a lot of one-off queries, which are irrelevant for this issue

Why is this? Not following.

Sorry, maybe I didn't make it clear in the issue description (I've edited now) that I intended this issue for dealing with static dataflows, e.g., dataflows for indexes or mat views. If we look only at static dataflows, then this issue seems straightforward (after we'll have access to the plans from the catalog), because it's true in almost all cases that I can think of that if two static dataflows are arranging the same thing, then an index would improve things.

Doing similar suggestions for one-off selects has more open questions. First of all, it's less clear that two one-off selects arranging the same thing mean that the user should create an index. This is because the index would use memory continuously, while the one-off selects might happen only sporadically. So it's not clear how many one-off selects (and over what time period) should arrange something before we give a hint that it should be indexed.

Aug 16 '23 13:08 ggevay

Agree on focusing scope! One-off selects are still a problem for development workflows and ad hoc analysis, but I think this is the right prioritization given that static dataflows would be the ones powering operational workloads.

I intended this issue for dealing with static dataflows, e.g., dataflows for indexes or mat views.

Aug 17 '23 14:08 ggnall

https://github.com/MaterializeInc/materialize/issues/20652 is resolved now, so this is good to go. I'll start working on it soon.

Aug 31 '23 19:08 ggevay

Related discussion in Slack.

TL;DR: having indexes on inputs that are consumed N times would also help with re-hydration spikes — instead of paying the persist overhead of fetching the input N times we would only pay it only once, reducing the memory required during re-hydration.

Oct 20 '23 11:10 aalexandrov

Putting some assorted thoughts here based on observations in customer environments.

TL;DR: a high-level policy might be:

Identify views that are used more than once.
Score higher views that are an MFP ending with a Reduce or TopK.
Recommend index creation for such views that are referenced more than some number k (for example k > 4).

Motivation:

A lot of customers run DBT models that are very heavy on views.
A lot of these views are doing basing JSON to relation unnesting and filtering + either aggregation or DISTINCT ON, so the optimized view plan will be an MFP with a Reduce or TopK on top.
If these views start being referenced multiple times it might make sense to recommend creating an index on them.

Feb 15 '24 16:02 aalexandrov

materialize materialize copied to clipboard

Suggest an index to the user when two dataflows are arranging the same thing

Feature request

materialize
materialize copied to clipboard