carbondata [CARBONDATA-4263]support query with latestSegment

support query with latest segment, when configure the TBLPROPERTIES that include "query_latest_segment".

Why is this PR needed?

some scenarios: the number of data rows does not change. The data of each column is increasing and changing. At this scenario, it is faster load full data each time using load command. In this case, the query only needs to query the latest segment Need a way to control the table do like this.

What changes were proposed in this PR?

add a new property :query_latest_segment when set 'true' ,it will get the latestSegment for query when set 'false' or not set， there will be no impact.

Does this PR introduce any user interface change?

No

Is any new testcase added?

Yes add new LatestSegmentTestCases

Jul 30 '21 14:07 MarvinLitt

Build Failed with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/211/

Jul 30 '21 15:07 CarbonDataQA2

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5806/

Jul 30 '21 15:07 CarbonDataQA2

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4063/

Jul 30 '21 15:07 CarbonDataQA2

retest this please

Aug 07 '21 14:08 MarvinLitt

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4079/

Aug 07 '21 14:08 CarbonDataQA2

Build Failed with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/227/

Aug 07 '21 14:08 CarbonDataQA2

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5824/

Aug 07 '21 15:08 CarbonDataQA2

retest this please

Aug 07 '21 15:08 MarvinLitt

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5825/

Aug 07 '21 17:08 CarbonDataQA2

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4080/

Aug 07 '21 17:08 CarbonDataQA2

retest this please

Aug 07 '21 17:08 MarvinLitt

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5826/

Aug 07 '21 19:08 CarbonDataQA2

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4081/

Aug 07 '21 19:08 CarbonDataQA2

Build Failed with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/229/

Aug 07 '21 19:08 CarbonDataQA2

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4086/

Aug 08 '21 16:08 CarbonDataQA2

please litao check if this solution can be made generic using the UDF 'insegment' already available in code and expose to the user in query statement rather than a config property @kunal642 @jackylk @ajantha-bhat @akashrn5 @QiangCai

Aug 09 '21 06:08 brijoobopanna

retest this please

Aug 09 '21 06:08 MarvinLitt

please litao check if this solution can be made generic using the UDF 'insegment' already available in code and expose to the user in query statement rather than a config property @kunal642 @jackylk @ajantha-bhat @akashrn5 @QiangCai

hi brijoo, i check the doc of SEGMENT MANAGEMENT. This ability can not meet the demands, and I have no way to increase the table configuration. The segment manager that use set to configure, but not all the tables need quey latest segment. And the business not known the query that should using latest segment or whole segments. So I can't think of any other method except specifying the configuration when creating a table。

Aug 09 '21 07:08 MarvinLitt

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5833/

Aug 09 '21 08:08 CarbonDataQA2

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4089/

Aug 09 '21 08:08 CarbonDataQA2

Build Success with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/236/

Aug 09 '21 08:08 CarbonDataQA2

@MarvinLitt

the new csv is not needed as we already have several csv in our resources, use any one of the existing one(this is not needed if point 2 is done)
better to use insert instead of load and no need for filter condition, just use select * because this is not related to any filter.
I can see only 2-3 conditions in your test cases, better to add them in a existing test suite, please avoid creating new test files unless absolutely necessary

Aug 11 '21 15:08 kunal642

the new csv is not needed as we already have several csv in our resources, use any one of the existing one(this is not needed if point 2 is done)

better to use insert instead of load and no need for filter condition, just use select * because this is not related to any filter.

I can see only 2-3 conditions in your test cases, better to add them in a existing test suite, please avoid creating new test files unless absolutely necessary

to use load command in test case because there are many scenarios for using load, hope to take care of this command in the test case.
new csv file is because latest-table-data.csv is not same with the old one. I removed some values from some columns， to check whether the latest segment is right or not.
yes, of couse the test case can move to an exists test case file, if need i will do.

@kunal642

Aug 12 '21 02:08 MarvinLitt

can we check why "overwrite data" is much slower than "load data"?

Aug 13 '21 01:08 QiangCai

can we check why "overwrite data" is much slower than "load data"?

it is obviously， overwrite need to check the data in segments, load do not need, they have great differences. In principle, there is a huge performance gap between overwrite and load. if use insert overwrite table from select * from another, need to load csv data as temp table, and select all data, all of that may take more time. this scenarios is very special，the performance is key poit. In addition, if insert overwrite is used, it will consume more time when querying @QiangCai

Aug 13 '21 02:08 MarvinLitt

if we can fix the performance issue of load overwrite, does it satisfy your requirement?

Aug 16 '21 09:08 QiangCai

if we can fix the performance issue of load overwrite, does it satisfy your requirement?

yes，if we can mkae the command of "insert overwrite" as qucikly as the command of "load", it can slove the problem. But when can we achieve consistent performance，it may take some time. It is also difficult to achieve in the short term。 In order not to lose customers, do we need to merge this pr first. When the insert overwrite performance is implemented, it can be switched seamlessly. The customer doesn't focus on what commands to use, but on performance. what do you think about @QiangCai

Aug 16 '21 11:08 MarvinLitt

I suggest we locate the performance issue in INSERT OVERWIRTE and fix it in the first place. Instead of creating a patch solution, which we may remove it later and create compactibility problem.

Aug 18 '21 07:08 jackylk

I agree with jacky

Aug 19 '21 13:08 kunal642

I had a discussion with @MarvinLitt and it seems that the performance issue in OVERWRITE is related to the environment and after the environment is fixed, the performance issue/degradation is not observed. @MarvinLitt to discuss in community if the requirement is needed.

Aug 20 '21 12:08 vikramahuja1001

carbondata carbondata copied to clipboard

[CARBONDATA-4263]support query with latestSegment

Why is this PR needed?

What changes were proposed in this PR?

Does this PR introduce any user interface change?

Is any new testcase added?

carbondata
carbondata copied to clipboard