blazingsql
blazingsql copied to clipboard
Column names are not read for ORC tables
Is your feature request related to a problem? Please describe.
Hi! I mostly work with parquet and csv files but there are also some orc
files in the db I use. I've noticed that the column names of my ORC tables are not inferred and instead the column names default to _col0
, _col1
, _col2
... Additionally, the names
argument is only enabled for reading (/creating) csv
tables (with create_tables
) hence I am not able to set the column names manually. My tables look like these when read with bsql
Describe the solution you'd like
I would like a names
argument when file_format
is set to orc
in create_table
: (https://github.com/BlazingDB/blazingsql/blob/92ed45f5af438fedc8cad82e4ef8ed3f3fb7eed6/docsrc/source/reference/python/tables/apache-orc.rst)
----For BlazingSQL Developers---- How and where should this be implemented? What part of the code should be feature be implemented? What should the APIs and/or classes look like?
Other design considerations What components of the engine could be affected by this? What functions should we make sure we use/reuse?
Testing considerations? What sort of unit tests and/or End to End tests be implemented to test this?
According to this hive issue, this is a problem with ORC tables create through hive and given that issue was reported in 2016 and it is still open I think it would be great to be able to assign column names on the fly/manually through the names
arguments for orc
files.
This isn't a high priority issue for us right now but you could accomplish this yourself by doing something like.
bc.create_table("table_name", bc.sql("select col_ as name1, col_2 as name2 from hive_table"))
You could also use our hive connection API. I believe that when we implemented this, we too into consideration the Hive issue you mention. https://docs.blazingsql.com/reference/python/tables/apache-hive.html
Thanks for the swift replies as always! Regarding the hive connection, have you seen any performance difference between using the HDFS connector vs the Hive connector? @williamBlazing
Thanks for the swift replies as always! Regarding the hive connection, have you seen any performance difference between using the HDFS connector vs the Hive connector? @williamBlazing
@wmalpica could you please follow up on this?