Nirvana icon indicating copy to clipboard operation
Nirvana copied to clipboard

JSON parsing. Am I on crazy pills?

Open davemcg opened this issue 4 years ago • 4 comments

R

library(tidyverse)
library(RJSONIO)
anno <- RJSONIO::fromJSON('/Users/mcgaugheyd/Desktop/sample717.json.gz')

anno_tib <- anno$positions %>% bind_rows() %>% as_tibble()

anno_tib %>% dim()

This returns 152,392 variants.

bash

zcat sample717_SmallVariants.genome.vcf.gz | grep PASS | wc -l

This returns 1,811,351 variants

Hmm. The vcf is full of 0/0 and 1/. Perhaps I should remove those?

bash

zcat sample717__SmallVariants.genome.vcf.gz | grep PASS | grep -v "0/0" | grep -v "0/." | wc -l

This returns 490 variants.

OK, let's look at the json file again.

R

anno_tib %>% filter(filters == 'PASS') %>% dim()

This returns 1288 variants.


Can you give some pointers on how to line the json and vcf up?

davemcg avatar Aug 04 '20 14:08 davemcg

Is the partial answer that intergenic variants don't get annotated? When I get around to figuring out how to left_join these together I'll figure this out - but right now I'm not 100% confident the json is holding all the variants.

davemcg avatar Aug 04 '20 15:08 davemcg

Hi David,

Nirvana will skip the reference call variants from the genome VCF file. Then reason is mainly to control the size of the JSON output.

haochenl avatar Aug 04 '20 21:08 haochenl

Is there some way you recommend pulling out the format fields?

Since you didn't put them into the "outside" of the file (like a vcf would have in the header. You did this for some stuff like the gene info) the only way I can figure out how to extract (for example, the 'cosmic' fields) is to do a kind of computationally expensive extraction where I flatten the entire column and run a grep.

Example:

#R

anno_tib %>% pull(variants) %>% unlist() %>% names() %>% grep('cosmic', ., ignore.case = TRUE, value = TRUE) %>% unique() %>% sort()

davemcg avatar Aug 06 '20 19:08 davemcg

@davemcg - did you find a solution for this that worked for you? I've given the two Jupyter Notebooks that are part of the docs a try but didn't really get proper output that I could use downstream.

apeltzer avatar Jul 11 '22 14:07 apeltzer