mango
mango copied to clipboard
Protobuf GA4GH objects serializable - for use in RDD?
I'm getting a weird overridden method compile type incompatibility error
[ERROR] /home/paschalj/mango8/mango/mango-core/src/main/scala/org/bdgenomics/mango/models/VariantContextMaterializationGA4GH.scala:104: error: overriding method stringify in class LazyMaterialization of type (rdd: org.apache.spark.rdd.RDD[(String, ga4gh.Variants.Variant)])Map[String,String];
[ERROR] method stringify has incompatible type
[ERROR] def stringify(data: RDD[(String, ga4gh.Variants.Variant)]): Map[String, String] = {
[ERROR] ^
[ERROR] one error found
when I try to override this abstract method:
https://github.com/bigdatagenomics/mango/blob/master/mango-core/src/main/scala/org/bdgenomics/mango/models/LazyMaterialization.scala#L78
def stringify(rdd: RDD[(String, T)]): Map[String, String]
with one where the RDD is holding a GA4GH protobuf defined type:
def stringify(data: RDD[(String, ga4gh.Variants.Variant)]): Map[String, String]
inside a subclass of LazyMaterialization where T = ga4gh.Variants.variant.
The error does not mention serialization, but I suspect the problem is that I don't think that the protobuf defined objects are serializable. Which made me realize that indeed I've read that protobuf, at least protobuf 2, doesn't seem to implement serializale, and may not be appropriate to be used directly in and RDD anyhow.
I did some testing on a single machine, with --master local[4] and surprising to me I could construct an RDD of type org.gag4gh.Variants.variant and repartition it - so it seems like Spark could deal with it.
I really thought that this would fail....
Anyhow - just wanted to see if anyone has a quick comment on the serializability of protobuf objects and their suitability for being used directly in an RDD.
Other recommendations/disucssion like here: https://forums.databricks.com/questions/129/how-do-i-use-sparksql-with-protocol-buffers.html suggests to use use the JSON String representation of the protobuf within Spark.
That's what I'll proceed to do, as inelegant as it is, unless someone has another suggestion.
I just realized the link you sent wasn't to your branch. Which branch is this on?
I hadn't pushed it, but now its ga4gh4 branch in my repo.
It gives the above incompatible type error when compiling.
https://github.com/jpdna/mango/blob/ga4gh4/mango-core/src/main/scala/org/bdgenomics/mango/models/VariantContextMaterializationGA4GH.scala#L104
https://github.com/jpdna/mango/blob/ga4gh4/mango-core/src/main/scala/org/bdgenomics/mango/models/VariantContextMaterializationGA4GH.scala#L47
@akmorrow13 - in that branch ga4gh4 I believe that VariantContextMaterializationGA4GH.scala does use the correct T, but I'm still not sure what the method type is incompatible for stringify. Also - I still question whether, despite my test, whether ga4gh.Variants.Variant is suitable to be a in an RDD, or if I ought to be storing its JSON string representation in the RDD.
Personal opinion here, but I don't think you should use them this way. The bdgenomics formats did a better job of managing the data representation, whereas the protobuf is meant for interchange only. For example, deserializing a message with no value assumes False for a boolean. I think this lack of nullity for some fields (strings are "", etc) will be confusing.
Although you could write an RDD over protobuf, spark is already doing a way better job than protobuf could at de/serializing its messages between your spark nodes. I hope I've understood correctly what you're trying to do here!
Thanks @david4096 We agree and had discussion on same lines today, we are going to push pb usage plan back out again to interchange output as you suggest