shc
shc copied to clipboard
Bug when reading dynamic columns
Hi SHC Team,
I encounter the following bug when read from a table with dynamic tables using the latest commit. Write into table does work as hbase shell show the correct columns:
hbase(main):011:0> scan 'map_bug_test'
ROW COLUMN+CELL
one column=cf:a, timestamp=1540453631696, value=?\x8C\xCC\xCD
one column=cf:b, timestamp=1540453631696, value=@S33
1 row(s)
Took 0.1304 seconds
while reading the same data shows no result:
val readDf = spark.read.options(Map(HBaseTableCatalog.tableCatalog->catalog)).format("org.apache.spark.sql.execution.datasources.hbase").load().show()
+------+----+
|rowkey|data|
+------+----+
+------+----+
readDf: Unit = ()
Full Testcode below:
import org.apache.spark.sql.execution.datasources.hbase._ import org.apache.spark.sql.types._
def catalog = s"""{ |"table":{"namespace":"default", "name":"map_bug_test"}, |"rowkey":"key", |"columns":{ |"rowkey":{"cf":"rowkey", "col":"key", "type":"string"}, |"data":{"cf":"cf", "col":"", "type":"map<string, float>"} |} |}""".stripMargin
val schema = new StructType().add("rowkey", StringType).add("data", MapType(StringType, FloatType))
val dataDS = Seq("""{"rowkey": "one", "data": {"a": 1.1, "b": 3.3}}""").toDS()
val df = spark.read.schema(schema).json(dataDS.rdd)
df.write.options(Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "5")).format("org.apache.spark.sql.execution.datasources.hbase").save()
val readDf = spark.read.options(Map(HBaseTableCatalog.tableCatalog->catalog)).format("org.apache.spark.sql.execution.datasources.hbase").load().show()
seems like mapType is not supported but wondering how you are able to write. Check issue #207 "I consider blocking MapType for future use but it shouldn't be a problem because MapType isn't primitive one"
Hi Vivek,
as you see in the linked pull request, this one was merged. We tested the master version and writing works fine, reading doesn't.
Regards, David
Hi, I think we found the issue with the code, HBaseRelation.RESTRICTIVE -> HBaseRelation.Restrictive.none was missing in the reading code. Now it seems to work.
One thing which would be great is to get some documentation about these RESTRICTIVE modes, what does family mean, what does none means. But we think we get it now.
Thanks and Regards, David
@davidvoit can you share how you set it in the reading code ?
spark.read.options(Map(HBaseTableCatalog.tableCatalog->catalog, HBaseRelation.RESTRICTIVE -> HBaseRelation.Restrictive.none)).format("org.apache.spark.sql.execution.datasources.hbase").load()
Hi @CZuegner ,
Thanks for pointing that.
I try that not working it is not able to recognize RESTRICTIVE from HBaseRelation, it seems the maven repository doesn't have the latest version of the project.
Is there a process to integrate this new release to a maven project ?
Any comments on @dfossouo 's question? It seems maven repository doesn't have the latest version of the project.
'RESTRICTIVE is not a member of HBaseRelation' How we can reflect latest version of the project in maven repository.
It's been 2 years since @dfossouo comment. Why the latest build is not being pushed to Hortworks repo? If it wasn't to be available to the community, can you guys please mention that in README until what it is available and after what it is not?
If this repo is not to be maintained, can original authors mark this repo 'archived' so that people won't spend time considering it for their production level applications?