spark-acid icon indicating copy to clipboard operation
spark-acid copied to clipboard

Issue 86 : Add support for Datasource V2 : ORC

Open maheshk114 opened this issue 4 years ago • 2 comments

maheshk114 avatar Jul 25 '20 16:07 maheshk114

@maheshk114 : I was testing the changes locally. I created a simple unpartitioned table and tried reading data in a range. I got the wrong results. Here is the query I tried:

sql("select * from t1 where id > 15 and id < 25").show

the output was:

+----------+---+
|      name| id|
+----------+---+
|new_name11| 11|
|new_name12| 12|
|new_name13| 13|
|new_name14| 14|
|new_name15| 15|
|new_name16| 16|
|new_name17| 17|
|new_name18| 18|
|new_name19| 19|
|new_name20| 20|
|new_name21| 21|
|new_name22| 22|
|new_name23| 23|
|new_name24| 24|
|new_name25| 25|
|new_name26| 26|
|new_name27| 27|
|new_name28| 28|
|new_name29| 29|
|new_name30| 30|
+----------+---+

Table schema is:

org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(id,IntegerType,true))

I get the expected result if I disable dsv2 reader. Let me know if you need more details regarding the above query.

Have you guys tested this PR in your environment?

sourabh912 avatar Sep 30 '20 18:09 sourabh912

@sourabh912 thanks for pointing it out. The issue is that ORC does not support row level filtering and thus the filtering has to be done again in spark. We have done some testing internally. But the hive/spark/ORC version is different. For this specific issue, this was already fixed internally, i forgot to merge it to this PR.

maheshk114 avatar Oct 16 '20 03:10 maheshk114