carbondata
carbondata copied to clipboard
[WIP] Carbon Store Size Optimization and Query Performance Improvement
What changes are proposed in this PR
String/Varchar Datatype Store Size Optimization: Currently length is stored as Short/Int for String/varchar datatype because of this store size is more. To reduce the store size Adaptive encoding is applied for length part irrespective of String/Varchar type so during processing there will not be separate handling for String/Varchar datatype.
String/Varchar datatype query processing optimization: Currently for processing the String/Varchar datatype during query offset(positions of data) is calculated and based on position data is fetched. Because of this many cacheline misses is happening and its degrading query performance. To handle this for full scan query with no inverted index, data is fetched is in linear way to avoid cache line misses.
Adaptive encoding for Global/Direct/Local dictionary columns Currently Global/Direct/Local dictionary are stored in binary format and only snappy is applied for compression. As Global/Direct/Local dictionary values are of Integer data type it can adaptability stored with the data type smaller than Integer. Added adaptive for global/direct dictionary column to reduce the store size.
Method In-lining Optimization JIT will inline any method if method size is less than 325 byte code size and if it is called more than 10K times(default value). If method is private or static it will be easier for JIT to inline as type safe check is not required, for protected/public method it will add a overhead of type check and because of this it will not behave as inline. Because of above case some refactoring is done for primitive no dictionary data type columns. Earlier ColumnPageWrapper.java was handling query processing for all primitive no dictionary data type column now in This PR separate classes are created for each data type handling and all the HOT method is kept as private and protected methods are overridden and other methods are added in Super classes
Note: Class/Method Documentation is pending.
Store Size Comparison Report : TPCH-lineitem table-74.1Gb(100Gb TPCH Data)
Current Master: 22.6GB With This PR: 19.9 GB Parquet: 21.4 GB
-
[ ] Any interfaces changed?
-
[ ] Any backward compatibility impacted?
-
[ ] Document update required?
-
[ ] Testing done All existing testcases will cover all the code changes
-
[ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/315/
Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/491/
Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/8561/
Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/495/
Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/8565/
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/319/
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/324/
Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/8570/
Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/500/
Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/335/
Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/8582/
Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/512/
Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/336/
Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/8583/
Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/513/
Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/369/
Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/8617/
Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/547/
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/385/
Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/8633/
Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/563/
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/390/
Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/8638/
Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/568/
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/465/
Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/645/
Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/8715/
Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/707/
Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/8774/
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/529/