carbondata [CARBONDATA-3770] Improve partition count star query performance

Why is this PR needed?

To improve pure partition count star performance

currently the count(*) with filter whose culumns are all partition columns will load datamaps of these partitions including block info/minmax info, but it is no need to load them ,we can just read it from valid index files directly as the rowCount stored inside, and cache these info.
For no-sort partition table, minmax is almost no using but cost time.

What changes were proposed in this PR?

The detail of query flow as following if it is pure partition count star Step 1. check whether it is pure partition count star by filter Step 2. read tablestatus to get all valid segments, remove the segment file cache of invalid segment and expired segment Step 3. use multi-thread to read segment files which not in cache and cache index files list of each segment into memory. If its index files already is eixst in cache, not require to read again. Step 4. use multi-thread to prune segment and partition to get pruned index file list, which can prune most index files and reduce the files num. Step 5. read the count from pruned index file directly and cache it, get from cache if exist in the index_file <-> rowCount map.

Does this PR introduce any user interface change?

No

Is any new testcase added?

No

Apr 09 '20 08:04 Zhangshunyu

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/977/

Apr 09 '20 10:04 CarbonDataQA1

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2690/

Apr 09 '20 11:04 CarbonDataQA1

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/980/

Apr 09 '20 13:04 CarbonDataQA1

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2693/

Apr 09 '20 14:04 CarbonDataQA1

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/994/

Apr 10 '20 12:04 CarbonDataQA1

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2707/

Apr 10 '20 12:04 CarbonDataQA1

@Zhangshunyu : What do you mean pure partition ? is it just normal "partition" ? Also mention what was the bottleneck before in the description.

Apr 11 '20 03:04 ajantha-bhat

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1000/

Apr 11 '20 05:04 CarbonDataQA1

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2712/

Apr 11 '20 06:04 CarbonDataQA1

@ajantha-bhat We find that select count() for some partitons is time costly and worse than parquet, as currently the count() with filter whose culumns are all partition columns will load all datamaps of these partitions including block info/minmax info, but it is no need to load them ,we can just read it from valid index files directly using partition prune as the rowCount stored inside index files, and we can cache these info. For no-sort partition table, minmax is almost no using but cost time.

Apr 13 '20 02:04 Zhangshunyu

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2714/

Apr 13 '20 05:04 CarbonDataQA1

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1002/

Apr 13 '20 05:04 CarbonDataQA1

@Indhumathi27 : we are already matching partition first, before the loading min max (your old PR) That was done only for select * flow, not for count(*) flow ?

Apr 13 '20 07:04 ajantha-bhat

@ajantha-bhat It was done for select with filter on partition columns., where we will load index for matched partiitons

Apr 13 '20 07:04 Indhumathi27

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2722/

Apr 13 '20 11:04 CarbonDataQA1

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1010/

Apr 13 '20 11:04 CarbonDataQA1

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1033/

Apr 15 '20 08:04 CarbonDataQA1

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2746/

Apr 15 '20 09:04 CarbonDataQA1

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2781/

Apr 18 '20 09:04 CarbonDataQA1

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1068/

Apr 18 '20 09:04 CarbonDataQA1

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1733/

Jul 23 '20 07:07 CarbonDataQA1

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3475/

Jul 23 '20 08:07 CarbonDataQA1

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3476/

Jul 23 '20 10:07 CarbonDataQA1

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1734/

Jul 23 '20 10:07 CarbonDataQA1

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2354/

Sep 16 '20 08:09 CarbonDataQA1

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4095/

Sep 16 '20 08:09 CarbonDataQA1

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4096/

Sep 16 '20 11:09 CarbonDataQA1

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2355/

Sep 16 '20 11:09 CarbonDataQA1

carbondata carbondata copied to clipboard

[CARBONDATA-3770] Improve partition count star query performance

Why is this PR needed?

What changes were proposed in this PR?

Does this PR introduce any user interface change?

Is any new testcase added?

carbondata
carbondata copied to clipboard