Nirvana
Nirvana copied to clipboard
JSON parsing. Am I on crazy pills?
R
library(tidyverse)
library(RJSONIO)
anno <- RJSONIO::fromJSON('/Users/mcgaugheyd/Desktop/sample717.json.gz')
anno_tib <- anno$positions %>% bind_rows() %>% as_tibble()
anno_tib %>% dim()
This returns 152,392 variants.
bash
zcat sample717_SmallVariants.genome.vcf.gz | grep PASS | wc -l
This returns 1,811,351 variants
Hmm. The vcf is full of 0/0
and 1/.
Perhaps I should remove those?
bash
zcat sample717__SmallVariants.genome.vcf.gz | grep PASS | grep -v "0/0" | grep -v "0/." | wc -l
This returns 490 variants.
OK, let's look at the json file again.
R
anno_tib %>% filter(filters == 'PASS') %>% dim()
This returns 1288 variants.
Can you give some pointers on how to line the json and vcf up?
Is the partial answer that intergenic variants don't get annotated? When I get around to figuring out how to left_join these together I'll figure this out - but right now I'm not 100% confident the json is holding all the variants.
Hi David,
Nirvana will skip the reference call variants from the genome VCF file. Then reason is mainly to control the size of the JSON output.
Is there some way you recommend pulling out the format fields?
Since you didn't put them into the "outside" of the file (like a vcf would have in the header. You did this for some stuff like the gene info) the only way I can figure out how to extract (for example, the 'cosmic' fields) is to do a kind of computationally expensive extraction where I flatten the entire column and run a grep.
Example:
#R
anno_tib %>% pull(variants) %>% unlist() %>% names() %>% grep('cosmic', ., ignore.case = TRUE, value = TRUE) %>% unique() %>% sort()
@davemcg - did you find a solution for this that worked for you? I've given the two Jupyter Notebooks that are part of the docs a try but didn't really get proper output that I could use downstream.