carbondata icon indicating copy to clipboard operation
carbondata copied to clipboard

[CARBONDATA-3864] Store Size Optimization

Open Indhumathi27 opened this issue 4 years ago • 82 comments

Why is this PR needed?

Currently, for measure types, carbon supports adaptive encoding to keep store size minimum. For dimension types such as STRING,VARCHAR and BINARY, data is stored as LV bytearrays. Local/Direct dictionary are stored in binary format. Adaptive encoding can be applied to Length of LV ByteArrays of Dimension types and local dictionary values to decrease store size.

What changes were proposed in this PR?

Following changes are made in this PR.

  1. String/Varchar/Binary Datatype Store Size Optimization: -> ON Data write Currently length is stored as Short/Int for String/(Varchar/Binary) datatype because of this store size is more. To reduce the store size Adaptive encoding is applied for length part irrespective of String/(Varchar/Binary) type so during processing there will not be separate handling for String/Varchar datatype. -> ON Query Support decode adaptive encoded dimension types(String/(Varchar/Binary)) with vector fill and non-vector flow.

  2. Local/Dict Dictionary -> On data write Support adaptive encoding for dictionary values -> On Query Decode dictionary values on query

  3. Refactor methods from adaptive class to column page.

Store Size comparison with Master: TPCH Data (57 GB Raw Data) Lineitem table: Sort type: NO_SORT Local Dictionary Enabled: false Raw Data Size: 39.5 GB

Master With PR
12.44 GB 11.37 GB

For a Table containing more string columns: Number of string columns: 75 Sort type: NO_SORT Local Dictionary Enabled: false Raw Data Size: 115 GB

Master With PR
32.71 GB 31.69 GB

NOTE: for above dataset, some string column data has null values

Does this PR introduce any user interface change?

  • No

Is any new testcase added?

  • No

Indhumathi27 avatar Jun 10 '20 06:06 Indhumathi27

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1419/

CarbonDataQA1 avatar Jun 10 '20 06:06 CarbonDataQA1

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3143/

CarbonDataQA1 avatar Jun 10 '20 06:06 CarbonDataQA1

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1420/

CarbonDataQA1 avatar Jun 10 '20 18:06 CarbonDataQA1

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3144/

CarbonDataQA1 avatar Jun 10 '20 18:06 CarbonDataQA1

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1422/

CarbonDataQA1 avatar Jun 11 '20 13:06 CarbonDataQA1

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3146/

CarbonDataQA1 avatar Jun 11 '20 13:06 CarbonDataQA1

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1424/

CarbonDataQA1 avatar Jun 12 '20 17:06 CarbonDataQA1

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3148/

CarbonDataQA1 avatar Jun 12 '20 17:06 CarbonDataQA1

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1425/

CarbonDataQA1 avatar Jun 15 '20 06:06 CarbonDataQA1

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3149/

CarbonDataQA1 avatar Jun 15 '20 06:06 CarbonDataQA1

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3151/

CarbonDataQA1 avatar Jun 15 '20 12:06 CarbonDataQA1

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1427/

CarbonDataQA1 avatar Jun 15 '20 12:06 CarbonDataQA1

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3153/

CarbonDataQA1 avatar Jun 15 '20 18:06 CarbonDataQA1

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1429/

CarbonDataQA1 avatar Jun 15 '20 18:06 CarbonDataQA1

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1431/

CarbonDataQA1 avatar Jun 16 '20 05:06 CarbonDataQA1

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3155/

CarbonDataQA1 avatar Jun 16 '20 05:06 CarbonDataQA1

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3157/

CarbonDataQA1 avatar Jun 16 '20 09:06 CarbonDataQA1

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1433/

CarbonDataQA1 avatar Jun 16 '20 09:06 CarbonDataQA1

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1434/

CarbonDataQA1 avatar Jun 16 '20 20:06 CarbonDataQA1

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3158/

CarbonDataQA1 avatar Jun 16 '20 20:06 CarbonDataQA1

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3160/

CarbonDataQA1 avatar Jun 17 '20 08:06 CarbonDataQA1

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1436/

CarbonDataQA1 avatar Jun 17 '20 08:06 CarbonDataQA1

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3164/

CarbonDataQA1 avatar Jun 17 '20 17:06 CarbonDataQA1

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1440/

CarbonDataQA1 avatar Jun 17 '20 17:06 CarbonDataQA1

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3170/

CarbonDataQA1 avatar Jun 18 '20 13:06 CarbonDataQA1

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1446/

CarbonDataQA1 avatar Jun 18 '20 13:06 CarbonDataQA1

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3172/

CarbonDataQA1 avatar Jun 18 '20 15:06 CarbonDataQA1

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1447/

CarbonDataQA1 avatar Jun 18 '20 15:06 CarbonDataQA1

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1473/

CarbonDataQA1 avatar Jun 24 '20 06:06 CarbonDataQA1

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3199/

CarbonDataQA1 avatar Jun 24 '20 06:06 CarbonDataQA1