spark-examples refactor 2-way PCA command line options

There seems to be a minor issue with the order of command line options. In the example below, --num-reduce-partitions is shown in two different positions.

Works fine:

spark-submit   --class com.google.cloud.genomics.spark.examples.VariantsPcaDriver   \
--conf spark.shuffle.spill=true   \
--master spark://hadoop-m:7077 \
googlegenomics-spark-examples-assembly-1.0.jar \
--client-secrets client_secrets.json \
--variant-set-id 10473108253681171589  3049512673186936334 \
--references 1:1:249881990 chr1:1:250226910 \
--output-path output/two-way-chr1-pca.tsv  \
--num-reduce-partitions 15

Yields a parse error:

spark-submit   --class com.google.cloud.genomics.spark.examples.VariantsPcaDriver   \
--conf spark.shuffle.spill=true   \
--master spark://hadoop-m:7077 \
googlegenomics-spark-examples-assembly-1.0.jar \
--client-secrets client_secrets.json \
--variant-set-id 10473108253681171589  3049512673186936334 \
--references 1:1:249881990 chr1:1:250226910 \
--num-reduce-partitions 15 \
--output-path output/two-way-chr1-pca.tsv

[scallop] Error: Failed to parse the trailing argument list: 'output/two-way-chr1-pca.tsv'

Jun 01 '15 17:06 deflaux

@deflaux Give it a try it with underscores like this, so that the minus sign does not get interpreted:

--output-path output/two_way_chr1_pca.tsv

Let us know if that works.

Thanks, ~p

Jun 01 '15 18:06 pgrosu

In general, it would be better to have specific command line options for the variant set and its associated references on the case and control sides of the comparison.

The --min-allele-frequency option for the control variant set will only work on variant sets with field AF such as 1,000 genomes phase 1 and phase 3 variants.

Perhaps something like:

--control-variant-set-id 10473108253681171589 \
--control references 1:1:249881990 \
--case-variant-set-id 3049512673186936334 \
--case-references chr1:1:250226910 \

Jan 22 '16 21:01 deflaux