mango
mango copied to clipboard
factor key String out of LazyMaterialization RDD?
Looking at the String that is the first element in the Value 2-tuple in:
var intRDD: IntervalRDD[ReferenceRegion, (String, T)] = null
https://github.com/bigdatagenomics/mango/blob/master/mango-core/src/main/scala/org/bdgenomics/mango/models/LazyMaterialization.scala#L64
This String appears to be always a constant based only on the filename of the input file. For example in the example data for variants this is: ALL_chr17_7500000-7515000_phase3_shapeit2_mvncall_integrated_v5a_20130502_genotypes_vcf
( note the flename itself is ALL.chr17.7500000-7515000.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf the key contains no actual genome range information )
So, within records loaded from a single file, first member of the Tuple stored in the IntervalRDD is always the same. This value could carry some further information in the case that multiple files (say one for each chr) were loaded into Mango and thus into the same intervalRDD, but even in that case the values doesn't seem useful as the intervalRDD is already accessed via its interval region key, and this key value seems to be thrown away, or is already known as it appeared in the query URL, by the time it gets returned within VizReads.
What am I missing about the use of this String in:
var intRDD: IntervalRDD[ReferenceRegion, (String, T)] = null
does it need to be taking up space in the RDD?
We could factor these out for many of the structures right away,
- using record group name or sample. Features
- some combination of feature types (not sure about this) Variants
- There is no good place to put this in in current bdg-formats, @jpdna we can easily slip this in GenotypeJson, although we were planning on getting rid of this. Coverage
- This doesnt have a good way to do this. We could, however, get rid of this structure completely and move it to FeatureMaterialization.
@jpdna I think the best thing to do is wait until https://github.com/bigdatagenomics/mango/pull/293 is in. We can then do it directly.