datafusion
datafusion copied to clipboard
Better not to display partitions info for ParquetExec
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Suppose there're tens of thousands of files needs to be scanned for one SQL. The shown partitions info for the "Physical plan with metrics" will be too messy. And it's also meaningless to show the partitions info because we have already had the filename for the specific ParquetExec task.
Describe the solution you'd like
Therefore, it's better not to show the partitions info.
Describe alternatives you've considered
Additional context
@alamb Do you think this is a reasonable change ?
Perhaps we can make this configurable using the new config mechanism
FWIW, attempting to use DataFusion to query AWS VPC Flow Logs, which produces thousands of parquet files per day, makes it impossible to explain plans. Removing the partition list details or condensing the output to just the number of files being read would be great.
@alamb Do you think this is a reasonable change ?
@Ted-Jiang I do -- perhaps by default the explain plans can summarize the information (e.g. print out the first few parquet files or something)
I like @andygrove 's suggestion to make the "show me the full details" a config option (that defaults to only showing the file summary).