blazingsql icon indicating copy to clipboard operation
blazingsql copied to clipboard

Column names are not read for ORC tables

Open lucharo opened this issue 3 years ago • 5 comments

Is your feature request related to a problem? Please describe. Hi! I mostly work with parquet and csv files but there are also some orc files in the db I use. I've noticed that the column names of my ORC tables are not inferred and instead the column names default to _col0, _col1, _col2... Additionally, the names argument is only enabled for reading (/creating) csv tables (with create_tables) hence I am not able to set the column names manually. My tables look like these when read with bsql image

Describe the solution you'd like I would like a names argument when file_format is set to orc in create_table: (https://github.com/BlazingDB/blazingsql/blob/92ed45f5af438fedc8cad82e4ef8ed3f3fb7eed6/docsrc/source/reference/python/tables/apache-orc.rst)

----For BlazingSQL Developers---- How and where should this be implemented? What part of the code should be feature be implemented? What should the APIs and/or classes look like?

Other design considerations What components of the engine could be affected by this? What functions should we make sure we use/reuse?

Testing considerations? What sort of unit tests and/or End to End tests be implemented to test this?

lucharo avatar Jun 01 '21 13:06 lucharo

According to this hive issue, this is a problem with ORC tables create through hive and given that issue was reported in 2016 and it is still open I think it would be great to be able to assign column names on the fly/manually through the names arguments for orc files.

lucharo avatar Jun 01 '21 13:06 lucharo

This isn't a high priority issue for us right now but you could accomplish this yourself by doing something like.

bc.create_table("table_name", bc.sql("select col_ as name1, col_2 as name2 from hive_table"))

felipeblazing avatar Jun 04 '21 19:06 felipeblazing

You could also use our hive connection API. I believe that when we implemented this, we too into consideration the Hive issue you mention. https://docs.blazingsql.com/reference/python/tables/apache-hive.html

wmalpica avatar Jun 04 '21 19:06 wmalpica

Thanks for the swift replies as always! Regarding the hive connection, have you seen any performance difference between using the HDFS connector vs the Hive connector? @williamBlazing

lucharo avatar Jun 07 '21 10:06 lucharo

Thanks for the swift replies as always! Regarding the hive connection, have you seen any performance difference between using the HDFS connector vs the Hive connector? @williamBlazing

@wmalpica could you please follow up on this?

lucharo avatar Nov 23 '21 11:11 lucharo