druid SQL: Use timestamp_floor when granularity is not safe.

Fixes #13182.

PR #12944 added a check at the execution layer to avoid materializing excessive amounts of time-granular buckets. This patch modifies the SQL planner to avoid generating queries that would throw such errors, by switching certain plans to use the timestamp_floor function instead of granularities. This applies both to the Timeseries query type, and the GroupBy timestampResultFieldGranularity feature.

The patch also goes one step further: we switch to timestamp_floor not just in the ETERNITY + non-ALL case, but also if the estimated number of time-granular buckets exceeds 100,000.

Finally, the patch modifies the timestampResultFieldGranularity field to consistently be a String rather than a Granularity. This ensures that it can be round-trip serialized and deserialized, which is useful when trying to execute the results of "EXPLAIN PLAN FOR" with GroupBy queries that use the timestampResultFieldGranularity feature.

Oct 11 '22 03:10 gianm

There are two better solutions that I didn't do in this patch due to them being more complex:

Improving the query execution logic so the granularity feature doesn't incur overhead for buckets that aren't needed. The idea here would be to let the cursor drive the bucketing, rather than letting the bucketing drive the cursor.
Improving the performance of timestamp_floor so granularity is not needed. The SQL planner would always generate GroupBy with timestamp_floor: no timeseries queries, and no GroupBy-with-granularity.

Personally I think path (2) is the best one for the future. That being said, there is a need to have these queries execute properly today, hence the present patch.

Oct 11 '22 03:10 gianm

Thanks for the review @rohangarg. I pushed up some changes based on your comments.

Oct 11 '22 21:10 gianm

The idea here would be to let the cursor drive the bucketing, rather than letting the bucketing drive the cursor.

This seems like a good longer-term solution. Can we go further? Any reason to create buckets that will be empty? If we're doing time-based grouping, and data is time ordered, we can create buckets on the fly as we read the data. This is the classic streaming grouping solution: one doesn't normally enumerate all possible customers, say, to group sales by customer.

There would have to be code to handle empty groups, if we we want to fill empty time slots. But, this can be done by enumerating the time buckets as we return results. If the next result is greater than the current time bucket, return zeros. For this, the set of buckets need not be materialized, just enumerated.

Improving the performance of timestamp_floor...

For fixed-length intervals, (week or less), the time floor should be a simple mod-then-subtract. For variable-length intervals, the logic is more complex. Can we split the implementations for those two cases? Week-or-less is super fast, Month-or-more pays the extra compute cost?

Oct 12 '22 23:10 paul-rogers

Can we go further? Any reason to create buckets that will be empty? If we're doing time-based grouping, and data is time ordered, we can create buckets on the fly as we read the data.

That's exactly what I meant by "let the cursor drive the bucketing". I meant only generate buckets based on timestamps we actually see. (Unless the user specifically requests that empty buckets be zero-filled.)

For this, the set of buckets need not be materialized, just enumerated.

In general, today, we don't materialize the buckets; we just enumerate them. The issue we see is that even with this approach, the amount of time it takes to enumerate buckets can be prohibitively large. (People raise bugs saying that queries "hang". They don't actually hang, but it seems that way to the user, due to the large number of buckets that are being enumerated.) By letting the cursor drive the bucketing, we can avoid this completely for the case where we aren't zero-filling. The zero-fill case still poses an issue, but we can address that some other way (limit on number of zero-filled buckets)?

For fixed-length intervals, (week or less), the time floor should be a simple mod-then-subtract. For variable-length intervals, the logic is more complex. Can we split the implementations for those two cases? Week-or-less is super fast, Month-or-more pays the extra compute cost?

Ah. In practice the main perf hit isn't from the evaluation speed of the timestamp_floor function itself. It does have some logic to fast-path common granularities in the way you mention. The bigger issues are that we don't take advantage of the fact that the segments are sorted by __time and the timestamp_floor function is monotonic. Two things we should do in order to get perf matching the way we handle granularity:

the group-by engine should do a streaming aggregation when the first group-by dimension is timestamp_floor(__time, ...)
the computation of timestamp_floor should take advantage of monotonicity: when we encounter a timestamp, we can compute the start and end of its bucket. The start should be re-used as the return value of timestamp_floor for any value of __time up til the end: there is no need to actually compute the function for each row.

Oct 16 '22 19:10 gianm