refactor 2-way PCA command line options
There seems to be a minor issue with the order of command line options. In the example below, --num-reduce-partitions is shown in two different positions.
Works fine:
spark-submit --class com.google.cloud.genomics.spark.examples.VariantsPcaDriver \
--conf spark.shuffle.spill=true \
--master spark://hadoop-m:7077 \
googlegenomics-spark-examples-assembly-1.0.jar \
--client-secrets client_secrets.json \
--variant-set-id 10473108253681171589 3049512673186936334 \
--references 1:1:249881990 chr1:1:250226910 \
--output-path output/two-way-chr1-pca.tsv \
--num-reduce-partitions 15
Yields a parse error:
spark-submit --class com.google.cloud.genomics.spark.examples.VariantsPcaDriver \
--conf spark.shuffle.spill=true \
--master spark://hadoop-m:7077 \
googlegenomics-spark-examples-assembly-1.0.jar \
--client-secrets client_secrets.json \
--variant-set-id 10473108253681171589 3049512673186936334 \
--references 1:1:249881990 chr1:1:250226910 \
--num-reduce-partitions 15 \
--output-path output/two-way-chr1-pca.tsv
[scallop] Error: Failed to parse the trailing argument list: 'output/two-way-chr1-pca.tsv'
@deflaux Give it a try it with underscores like this, so that the minus sign does not get interpreted:
--output-path output/two_way_chr1_pca.tsv
Let us know if that works.
Thanks, ~p
In general, it would be better to have specific command line options for the variant set and its associated references on the case and control sides of the comparison.
The --min-allele-frequency option for the control variant set will only work on variant sets with field AF such as 1,000 genomes phase 1 and phase 3 variants.
Perhaps something like:
--control-variant-set-id 10473108253681171589 \
--control references 1:1:249881990 \
--case-variant-set-id 3049512673186936334 \
--case-references chr1:1:250226910 \