trino
trino copied to clipboard
Allow scans with the number of partitions exceeding the limit
Description
Treat such scans as scans over non partitioned tables without applying partition pruning. This is needed to avoid storing too much information in HiveTableHandle.
Non-technical explanation
Allow scanning more partitions than the limit per scan
Release notes
( ) This is not user-visible or docs only and no release notes are required. (x) Release notes are required, please propose a release note for me. ( ) Release notes are required, with the following suggested text:
# Section
* Fix some things. ({issue}`issuenumber`)
An attempt to remove the limitation on the number of partitions per scan. This change doesn't remove the setting defining the limit, instead when the number of partitions is higher than the limit the system will try to avoid loading partitions eagerly. Certain optimizations are not possible (such as partition pruning) when partitions cannot be loaded inmemory.
This change also changes how HiveSplitSource works to minimize memory footprint when the number of partitions is high.
@electrum @findepi @Praveen2112 Could you please take a look and let me know what you think?
hive.max-partitions-per-scan
could be used by admins to block queries which scan a huge number of partitions. If we're going to allow running such queries now without partition pruning, that totally changes the meaning of that config and it might not be the behaviour that an admin wants.
Could we define a separate config for the threshold on the number of partitions after which we want to avoid loading partitions eagerly and use that to cap memory usage ? If someone wants to allow queries on a large number of partitions, they can raise the limit on hive.max-partitions-per-scan
in their config.
The hive.max-partitions-per-scan
setting was added in 2016 by @electrum to avoid memory pressure on coordinator. This change resolves the problem with memory pressure by avoiding enumerating partitions eagerly in one shot. Since the underlying problem is resolved I wonder whether it is necessary to have a configured limit? Though I agree that the property name may need to be changed and I'm open for suggestions.
One practical way this config is used to prevent people from running queries which scan too many partitions (to prevent costs and to force users to add predicates on partition columns to get efficient queries).
For cost there's the better option now with query.max-scan-physical-bytes
(and query_max_scan_physical_bytes
session property) - it was added in 339/341.
I don't think someone would still actually want to enforce limits based on number of partitions since that's very arbitrary but I agree with @raunaqmorarka we should not re-purpose existing configs.
Ideally we can mark existing config as @DefunctConfig
so that proper error is thrown for people who have that set and we can add a new config if we want for new behaviour to act as a kill switch for some time while the new impl is proven out.
I don't think someone would still actually want to enforce limits based on number of partitions since that's very arbitrary
Looking at configs like hive.query-partition-filter-required
, the hive.max-partitions-per-scan
is even more reasonable as a safety net preventing queries from accessing "too much"
cc @JamesRTaylor
we should not re-purpose existing configs.
Not like this, yes
Ideally we can mark existing config as @DefunctConfig so that proper error is thrown for people who have that set
Do we have any migration path? Do we need any?
For cost there's the better option now with query.max-scan-physical-bytes
query.max-scan-physical-bytes
is useful but the problem with it is that the limit will be enforced after the specified amount of data has been already read. This would potentially waste a lot of resources before killing the query. hive.max-partitions-per-scan
has the advantage of stopping the query before we spend worker resources to scan the data.
The hive.max-partitions-per-scan setting was added in 2016 to avoid memory pressure on coordinator
I don't think users take into account the intent of the original author when using a config. In this case there is no documentation or description available to see that this had anything to do with memory. So anyone already using it can't possibly know that this was not meant to be used as a way to block queries touching a large number of partitions.
I think we should look into deprecating and removing hive.max-partitions-per-scan
as a separate issue if we're convinced that it's not useful anymore. Maybe start a discussion in slack to see if there are ppl relying on its current behaviour.
We can have a new config for switching to lazy loading of partitions beyond some threshold and assume that those who want to query a large number of partitions will raise the limit on hive.max-partitions-per-scan
to a larger number to get past the errors thrown by that.
I'm also wondering if this trade-off of reducing coordinator memory usage but giving up on partition pruning makes sense. If someone wanted to run queries on a large number of partitions and the coordinator memory was a limiting factor, why wouldn't they get a bigger coordinator instead of incurring higher cost of running query without partition pruning ? It seems cheaper to upgrade 1 node rather than consume a lot more resources on workers.
Is it possible that we can still prune splits on the workers using the predicate on partitioned columns ? E.g. for dynamic partition pruning we have HivePageSourceProvider#shouldSkipSplit
, could we fallback to similar split pruning for static predicates as well when partition pruning on coordinator was skipped to keep memory usage low ?
Thanks everybody for the feedback.
I updated the PR preserving the hive.max-partitions-per-scan
. The hive.max-partitions-per-scan
property can still be used to enforce the limit on the maximum number of partitions.
Instead I introduced a new property, hive.max-partitions-for-eager-load
. This property controls how many partitions can be loaded eagerly on coordinator and it is set to 100000 by default.
This PR also makes it possible to scan tables that exceed the value set by hive.max-partitions-for-eager-load
. It is done by refactoring the HiveSplitManager
to avoid loading partitions eagerly. However when the number of partitions is higher some optimizations (such as filter pushdown) will not be performed. While it is not ideal (and we should probably improve it) this PR is an incremental improvement allowing the engine to be used on tables with high number of partitions, although with potentially reduced efficiency.
Please take an another look.
Updated
Thx for very interesting feature. We were expecting this kind of feature to limit maximum number of partitions.
- https://trinodb.slack.com/archives/CP1MUNEUX/p1646707920591449?thread_ts=1645502073.803079&cid=CP1MUNEUX
Before this, we were using Bytebuddy to Hook HivePartitionManager class to prevent users from throwing abusing queries and slowing cluster by having limitations on maximum number of partitions they can query in single query.
We simply check and return error if number of partitions read is greater than config for specific tables only (tables which we are sure that has HUGE amount of data per partitions)