spark-acid
spark-acid copied to clipboard
Issue 86 : Add support for Datasource V2 : ORC
@maheshk114 : I was testing the changes locally. I created a simple unpartitioned table and tried reading data in a range. I got the wrong results. Here is the query I tried:
sql("select * from t1 where id > 15 and id < 25").show
the output was:
+----------+---+
| name| id|
+----------+---+
|new_name11| 11|
|new_name12| 12|
|new_name13| 13|
|new_name14| 14|
|new_name15| 15|
|new_name16| 16|
|new_name17| 17|
|new_name18| 18|
|new_name19| 19|
|new_name20| 20|
|new_name21| 21|
|new_name22| 22|
|new_name23| 23|
|new_name24| 24|
|new_name25| 25|
|new_name26| 26|
|new_name27| 27|
|new_name28| 28|
|new_name29| 29|
|new_name30| 30|
+----------+---+
Table schema is:
org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(id,IntegerType,true))
I get the expected result if I disable dsv2 reader. Let me know if you need more details regarding the above query.
Have you guys tested this PR in your environment?
@sourabh912 thanks for pointing it out. The issue is that ORC does not support row level filtering and thus the filtering has to be done again in spark. We have done some testing internally. But the hive/spark/ORC version is different. For this specific issue, this was already fixed internally, i forgot to merge it to this PR.