shc
shc copied to clipboard
Support for wide tables with dynamic column names?
If I were to have a very wide HBase table that has dynamically named columns, is there a way to use such a table (both read/write) with shc? Iterating a standard catalog wouldn't be feasible, since there could be, say, a million columns.
Thanks.
In the catalog, you only need to define the columns you need to write/read into/from Hbase table no matter how many columns are in the table (you can ignore the columns you do not care about).
OK, but with wide tables, one would generally need to be concerned with all of the columns, and the overall schema (that is, the set of columns) could vary at any given time given the specifics of the data. e.g. If you are, say, doing an in-depth analysis of files with a row keyed by sha1, and then have, say, tens of thousands of columns - dynamically named to uniquely identify them -- storing the various data about them. It doesn't seem feasible to specify a catalog for such a table.
Wide tables are a fairly common use case in HBase, and it seems like this is a serious limitation to using wide HBase tables in conjunction with shc at all. Does Hortonworks and/or Bloomberg use wide tables in conjunction with shc? If so, how is that done in practice? Are there plans to add such dynamic support in the future? I think it would be quite valuable.
Thanks.
I'm also voting for this issue, we also need to resolve similar problem. The ideal would be if we can read the dynamic column as it is done in Hive. Take a look here (https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration) to the section Hive MAP to HBase Column Prefix
. I created a prototype solution for reading such columns, but it is a little bit hacky. Unfortunately, I don't have an idea how to do that in the proper way. To have the ability to read and write such table. UserDefinedType could be helpful but is it in close API since Spark 2.0.
I offer my help to resolve this issue but I need to find out an idea how to do that.
@btomala Welcome PRs for this issue. Thanks.
Ok, but do you have any idea how to do that?
Here https://github.com/hortonworks-spark/shc/pull/197 I raised PR with this what I did to read the dynamic column.
What about dynamic columns for writing? I mean, use unique identifiers as columns. This is not supported by catalog definition where You have to "pre-define" the schema...