spark-acid Issue 86 : Add support for Datasource V2 : ORC

Issue 86 : Add support for Datasource V2 : ORC

Open maheshk114 opened this issue 4 years ago • 2 comments

Jul 25 '20 16:07 maheshk114

@maheshk114 : I was testing the changes locally. I created a simple unpartitioned table and tried reading data in a range. I got the wrong results. Here is the query I tried:

sql("select * from t1 where id > 15 and id < 25").show

the output was:

+----------+---+
|      name| id|
+----------+---+
|new_name11| 11|
|new_name12| 12|
|new_name13| 13|
|new_name14| 14|
|new_name15| 15|
|new_name16| 16|
|new_name17| 17|
|new_name18| 18|
|new_name19| 19|
|new_name20| 20|
|new_name21| 21|
|new_name22| 22|
|new_name23| 23|
|new_name24| 24|
|new_name25| 25|
|new_name26| 26|
|new_name27| 27|
|new_name28| 28|
|new_name29| 29|
|new_name30| 30|
+----------+---+

Table schema is:

org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(id,IntegerType,true))

I get the expected result if I disable dsv2 reader. Let me know if you need more details regarding the above query.

Have you guys tested this PR in your environment?

Sep 30 '20 18:09 sourabh912

@sourabh912 thanks for pointing it out. The issue is that ORC does not support row level filtering and thus the filtering has to be done again in spark. We have done some testing internally. But the hive/spark/ORC version is different. For this specific issue, this was already fixed internally, i forgot to merge it to this PR.

Oct 16 '20 03:10 maheshk114

spark-acid spark-acid copied to clipboard

Issue 86 : Add support for Datasource V2 : ORC

spark-acid
spark-acid copied to clipboard