results of "genotype count" are different - depending on used programs
Hello,
we are using Adam to count the total number of Genotypes for the input file “ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf” from 1000 Genome project stored on s3
With Adam we run these 2 steps:
Step 1: convert to Adam with vcf2adam :
adam-submit --master yarn-client --driver-memory 8g
--num-executors $TOTAL_EXECUTORS
--executor-cores $CORES_PER_EXECUTOR
--executor-memory $MEMORY_PER_EXECUTOR
--
vcf2adam
-parquet_compression_codec SNAPPY
hdfs:///user/ec2-user/1kg/chr1.vcf \
hdfs:///user/ec2-user/1kg/chr1.adam
then we start Step2: Adam-Shell
adam-shell --master yarn-client --driver-memory 8g
--num-executors $TOTAL_EXECUTORS
--executor-cores $CORES_PER_EXECUTOR
--executor-memory $MEMORY_PER_EXECUTOR
scala> import org.bdgenomics.adam.rdd.ADAMContext
scala> import org.bdgenomics.formats.avro._
scala> val ac = new ADAMContext(sc)
scala> val genotypes = ac.loadGenotypes("/user/ec2-user/1kg/chr1.adam")
scala> genotypes.count
the result : 16277357168
As a comparision we are also using the HTSJDK with this script:
public static void main(String[] args) throws InterruptedException,
SQLException, IOException {
int position;
String uuid, chromosome;
VariantContext ctx;
//HashMap<String,Boolean> gc = new HashMap<String, Boolean>();
long count = 0;
final VCFCodec codec = new VCFCodec();
final FeatureReader<VariantContext> vcfr = AbstractFeatureReader
.getFeatureReader("c:/tmp/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz", codec, false);
final CloseableTribbleIterator<VariantContext> i = vcfr.iterator();
while (i.hasNext()) {
ctx = i.next();
GenotypesContext g = ctx.getGenotypes();
count += g.size();
}
}
And this shows a different result:
16196107376
Any idea ? Are we using the wrong commands ?
greetings -Jerry
I would guess this has to do with how ADAM handles multiallelic variants. You'll notice that the ADAM Variant schema only supports a single alternate allele:
https://github.com/bigdatagenomics/bdg-formats/blob/master/src/main/resources/avro/bdg.avdl#L430-L438
If we encounter variants with multiple alternates, we emit a Variant object for each one. Is this right, @fnothaft?
@laserson that is correct!