carbondata
carbondata copied to clipboard
[CARBONDATA-4263]support query with latestSegment
support query with latest segment, when configure the TBLPROPERTIES that include "query_latest_segment".
Why is this PR needed?
some scenarios: the number of data rows does not change. The data of each column is increasing and changing. At this scenario, it is faster load full data each time using load command. In this case, the query only needs to query the latest segment Need a way to control the table do like this.
What changes were proposed in this PR?
add a new property :query_latest_segment when set 'true' ,it will get the latestSegment for query when set 'false' or not set, there will be no impact.
Does this PR introduce any user interface change?
- No
Is any new testcase added?
- Yes add new LatestSegmentTestCases
Build Failed with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/211/
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5806/
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4063/
retest this please
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4079/
Build Failed with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/227/
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5824/
retest this please
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5825/
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4080/
retest this please
Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5826/
Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4081/
Build Failed with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/229/
Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4086/
please litao check if this solution can be made generic using the UDF 'insegment' already available in code and expose to the user in query statement rather than a config property @kunal642 @jackylk @ajantha-bhat @akashrn5 @QiangCai
retest this please
please litao check if this solution can be made generic using the UDF 'insegment' already available in code and expose to the user in query statement rather than a config property @kunal642 @jackylk @ajantha-bhat @akashrn5 @QiangCai
hi brijoo, i check the doc of SEGMENT MANAGEMENT. This ability can not meet the demands, and I have no way to increase the table configuration. The segment manager that use set to configure, but not all the tables need quey latest segment. And the business not known the query that should using latest segment or whole segments. So I can't think of any other method except specifying the configuration when creating a table。
Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5833/
Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4089/
Build Success with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/236/
@MarvinLitt
- the new csv is not needed as we already have several csv in our resources, use any one of the existing one(this is not needed if point 2 is done)
- better to use insert instead of load and no need for filter condition, just use select * because this is not related to any filter.
- I can see only 2-3 conditions in your test cases, better to add them in a existing test suite, please avoid creating new test files unless absolutely necessary
- the new csv is not needed as we already have several csv in our resources, use any one of the existing one(this is not needed if point 2 is done)
- better to use insert instead of load and no need for filter condition, just use select * because this is not related to any filter.
- I can see only 2-3 conditions in your test cases, better to add them in a existing test suite, please avoid creating new test files unless absolutely necessary
- to use load command in test case because there are many scenarios for using load, hope to take care of this command in the test case.
- new csv file is because latest-table-data.csv is not same with the old one. I removed some values from some columns, to check whether the latest segment is right or not.
- yes, of couse the test case can move to an exists test case file, if need i will do.
@kunal642
can we check why "overwrite data" is much slower than "load data"?
can we check why "overwrite data" is much slower than "load data"?
it is obviously, overwrite need to check the data in segments, load do not need, they have great differences. In principle, there is a huge performance gap between overwrite and load. if use insert overwrite table from select * from another, need to load csv data as temp table, and select all data, all of that may take more time. this scenarios is very special,the performance is key poit. In addition, if insert overwrite is used, it will consume more time when querying @QiangCai
if we can fix the performance issue of load overwrite, does it satisfy your requirement?
if we can fix the performance issue of load overwrite, does it satisfy your requirement?
yes,if we can mkae the command of "insert overwrite" as qucikly as the command of "load", it can slove the problem. But when can we achieve consistent performance,it may take some time. It is also difficult to achieve in the short term。 In order not to lose customers, do we need to merge this pr first. When the insert overwrite performance is implemented, it can be switched seamlessly. The customer doesn't focus on what commands to use, but on performance. what do you think about @QiangCai
I suggest we locate the performance issue in INSERT OVERWIRTE and fix it in the first place. Instead of creating a patch solution, which we may remove it later and create compactibility problem.
I agree with jacky
I had a discussion with @MarvinLitt and it seems that the performance issue in OVERWRITE is related to the environment and after the environment is fixed, the performance issue/degradation is not observed. @MarvinLitt to discuss in community if the requirement is needed.