datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Better not to display partitions info for ParquetExec

Open yahoNanJing opened this issue 3 years ago • 4 comments

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Suppose there're tens of thousands of files needs to be scanned for one SQL. The shown partitions info for the "Physical plan with metrics" will be too messy. And it's also meaningless to show the partitions info because we have already had the filename for the specific ParquetExec task.

Describe the solution you'd like

Therefore, it's better not to show the partitions info.

Describe alternatives you've considered

Additional context

yahoNanJing avatar Jul 26 '22 03:07 yahoNanJing

@alamb Do you think this is a reasonable change ?

Ted-Jiang avatar Jul 26 '22 07:07 Ted-Jiang

Perhaps we can make this configurable using the new config mechanism

andygrove avatar Jul 26 '22 08:07 andygrove

FWIW, attempting to use DataFusion to query AWS VPC Flow Logs, which produces thousands of parquet files per day, makes it impossible to explain plans. Removing the partition list details or condensing the output to just the number of files being read would be great.

kmitchener avatar Jul 26 '22 15:07 kmitchener

@alamb Do you think this is a reasonable change ?

@Ted-Jiang I do -- perhaps by default the explain plans can summarize the information (e.g. print out the first few parquet files or something)

I like @andygrove 's suggestion to make the "show me the full details" a config option (that defaults to only showing the file summary).

alamb avatar Jul 26 '22 20:07 alamb