shc icon indicating copy to clipboard operation
shc copied to clipboard

Bug when reading dynamic columns

Open CZuegner opened this issue 7 years ago • 8 comments

Hi SHC Team,

I encounter the following bug when read from a table with dynamic tables using the latest commit. Write into table does work as hbase shell show the correct columns: hbase(main):011:0> scan 'map_bug_test' ROW COLUMN+CELL
one column=cf:a, timestamp=1540453631696, value=?\x8C\xCC\xCD
one column=cf:b, timestamp=1540453631696, value=@S33
1 row(s) Took 0.1304 seconds

while reading the same data shows no result:

val readDf = spark.read.options(Map(HBaseTableCatalog.tableCatalog->catalog)).format("org.apache.spark.sql.execution.datasources.hbase").load().show() +------+----+
|rowkey|data| +------+----+ +------+----+

readDf: Unit = ()


Full Testcode below:

import org.apache.spark.sql.execution.datasources.hbase._ import org.apache.spark.sql.types._

def catalog = s"""{ |"table":{"namespace":"default", "name":"map_bug_test"}, |"rowkey":"key", |"columns":{ |"rowkey":{"cf":"rowkey", "col":"key", "type":"string"}, |"data":{"cf":"cf", "col":"", "type":"map<string, float>"} |} |}""".stripMargin

val schema = new StructType().add("rowkey", StringType).add("data", MapType(StringType, FloatType))

val dataDS = Seq("""{"rowkey": "one", "data": {"a": 1.1, "b": 3.3}}""").toDS()

val df = spark.read.schema(schema).json(dataDS.rdd)

df.write.options(Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "5")).format("org.apache.spark.sql.execution.datasources.hbase").save()

val readDf = spark.read.options(Map(HBaseTableCatalog.tableCatalog->catalog)).format("org.apache.spark.sql.execution.datasources.hbase").load().show()

CZuegner avatar Oct 25 '18 07:10 CZuegner

seems like mapType is not supported but wondering how you are able to write. Check issue #207 "I consider blocking MapType for future use but it shouldn't be a problem because MapType isn't primitive one"

vivekjain123 avatar Oct 25 '18 08:10 vivekjain123

Hi Vivek,

as you see in the linked pull request, this one was merged. We tested the master version and writing works fine, reading doesn't.

Regards, David

davidvoit avatar Oct 25 '18 10:10 davidvoit

Hi, I think we found the issue with the code, HBaseRelation.RESTRICTIVE -> HBaseRelation.Restrictive.none was missing in the reading code. Now it seems to work.

One thing which would be great is to get some documentation about these RESTRICTIVE modes, what does family mean, what does none means. But we think we get it now.

Thanks and Regards, David

davidvoit avatar Nov 02 '18 19:11 davidvoit

@davidvoit can you share how you set it in the reading code ?

dfossouo avatar Nov 06 '18 23:11 dfossouo

spark.read.options(Map(HBaseTableCatalog.tableCatalog->catalog, HBaseRelation.RESTRICTIVE -> HBaseRelation.Restrictive.none)).format("org.apache.spark.sql.execution.datasources.hbase").load()

CZuegner avatar Nov 07 '18 05:11 CZuegner

Hi @CZuegner ,

Thanks for pointing that.

I try that not working it is not able to recognize RESTRICTIVE from HBaseRelation, it seems the maven repository doesn't have the latest version of the project.

Is there a process to integrate this new release to a maven project ?

dfossouo avatar Nov 07 '18 09:11 dfossouo

Any comments on @dfossouo 's question? It seems maven repository doesn't have the latest version of the project.

'RESTRICTIVE is not a member of HBaseRelation' How we can reflect latest version of the project in maven repository.

jbigd avatar Feb 25 '19 07:02 jbigd

It's been 2 years since @dfossouo comment. Why the latest build is not being pushed to Hortworks repo? If it wasn't to be available to the community, can you guys please mention that in README until what it is available and after what it is not?

If this repo is not to be maintained, can original authors mark this repo 'archived' so that people won't spend time considering it for their production level applications?

injulkarnilesh avatar Oct 30 '20 13:10 injulkarnilesh