materialize
materialize copied to clipboard
Suggest an index to the user when two dataflows are arranging the same thing
Feature request
If we have two active (edit: static) dataflows both arranging the same input, then we should suggest it to the user to index it. This could happen either by some continuous monitoring detecting this, or simply when the second dataflow gets created.
This was happening recently to a customer: https://materializeinc.slack.com/archives/C02PPB50ZHS/p1690371972579099?thread_ts=1690294151.692909&cid=C02PPB50ZHS
cc @ggnall
Edit: This could be considered a duplicate of https://github.com/MaterializeInc/materialize/issues/20878. I think originally I was thinking here to look only at inputs, while in the other issue to look at also larger plan fragments, but we might want to tackle these at the same time.
Blocked either by #20660 or https://github.com/MaterializeInc/materialize/issues/20652, see https://github.com/MaterializeInc/materialize/issues/20802#issuecomment-1678620791. Edit: https://github.com/MaterializeInc/materialize/issues/20652 is resolved now.
Note that we might want to do something similar not just for static dataflows but also for one-off selects, but I would consider that out of scope for this issue.
Edit 2: We shouldn't forget about the case when both arrangements are preceded by some non-trivial projections. Maybe in this case we should suggest also a projection, which would be the union of the existing ones.
(Currently blocked by the need to record query plans in the catalog.)
Blocked by: #20660
Yes, https://github.com/MaterializeInc/materialize/issues/20660 would be one way to unblock this. However, I'm thinking that instead we might want to build on @aalexandrov's work that will add query plans to the catalog (https://github.com/MaterializeInc/materialize/issues/20652). We'll have to think this through carefully whether that is indeed the better approach. Some possible issues/questions with building this on the query log:
- We'd have to skip through a lot of one-off queries, which are irrelevant for this issue. Or will there be a fast way to ask the query log about only materialize views and indexes?
- The query log might be sampled to some degree, so maybe we'd miss some relevant dataflows.
- Will the query log contain very old queries, or will it have some retention period? Some relevant dataflow might have been installed a long time ago.
I see!
Why is this? Not following.
We'd have to skip through a lot of one-off queries, which are irrelevant for this issue
Not that I'm aware of. There was a recent request to add collection_id
s, which may provide an index search? Thread @umanwizard
Or will there be a fast way to ask the query log about only materialize views and indexes?
Confirming this will be sampled.
The query log might be sampled to some degree, so maybe we'd miss some relevant dataflows.
Similarly, these will have a configurable retention period, e.g. 30 days.
Will the query log contain very old queries, or will it have some retention period?
We'd have to skip through a lot of one-off queries, which are irrelevant for this issue
Why is this? Not following.
Sorry, maybe I didn't make it clear in the issue description (I've edited now) that I intended this issue for dealing with static dataflows, e.g., dataflows for indexes or mat views. If we look only at static dataflows, then this issue seems straightforward (after we'll have access to the plans from the catalog), because it's true in almost all cases that I can think of that if two static dataflows are arranging the same thing, then an index would improve things.
Doing similar suggestions for one-off selects has more open questions. First of all, it's less clear that two one-off selects arranging the same thing mean that the user should create an index. This is because the index would use memory continuously, while the one-off selects might happen only sporadically. So it's not clear how many one-off selects (and over what time period) should arrange something before we give a hint that it should be indexed.
Agree on focusing scope! One-off selects are still a problem for development workflows and ad hoc analysis, but I think this is the right prioritization given that static dataflows would be the ones powering operational workloads.
I intended this issue for dealing with static dataflows, e.g., dataflows for indexes or mat views.
https://github.com/MaterializeInc/materialize/issues/20652 is resolved now, so this is good to go. I'll start working on it soon.
TL;DR: having indexes on inputs that are consumed N
times would also help with re-hydration spikes — instead of paying the persist
overhead of fetching the input N
times we would only pay it only once, reducing the memory required during re-hydration.
Putting some assorted thoughts here based on observations in customer environments.
TL;DR: a high-level policy might be:
- Identify views that are used more than once.
- Score higher views that are an MFP ending with a
Reduce
orTopK
. - Recommend index creation for such views that are referenced more than some number
k
(for examplek > 4
).
Motivation:
- A lot of customers run DBT models that are very heavy on views.
- A lot of these views are doing basing JSON to relation unnesting and filtering + either aggregation or
DISTINCT ON
, so the optimized view plan will be an MFP with aReduce
orTopK
on top. - If these views start being referenced multiple times it might make sense to recommend creating an index on them.