superFreq Explanation of fields in rivers files

Explanation of fields in rivers files

Open kmavrommatis opened this issue 3 years ago • 1 comments

Hi, I am trying to parse the text output files from superFreq in order to get the VAF of specific mutations for each clone. I think the files that are created under the rivers directory should contain this information, but I am having difficulty extract the VAF of the mutation from these files. e.g for a mutation in location chr1: 9737640, Normal is 100% A, Tumor is 28% G Superfreq finds (in file -river.tsv:

chr	start	end	name	clone	sample	sample.1	severity	type	AApos	AAbefore	AAafter	isCosmicCensus
1	9737640	9737640	CLSTN1 (1) intron	2	0.595856955671464	0.112332882535099	22	intron	FALSE

Which is assigned to clone 2. Clone 2 is a clone with abundance 51%. Assuming this position is Het, it means that the VAF is ~25% which is within expected value. How can I confirm the VAF of this mutation, or rather find the information if it is homozygous or heterozygous in this clone? What do the values under sample and sample.1 mean?

Is this logic valid or am I missing something? Thanks in advance for your help

Apr 03 '21 06:04 kmavrommatis

Hey!

The logic is valid. I think sample and sample.1 are the clonalities (note sample cell fraction, not cancer cell fraction) of the variant in the samples.

The river output deals with clonalities as opposed to VAFs, so you wont find VAF information there, although you can reverse-calculate it by matching against local CNA as you suggest. Probably better to look at somaticVariants.xls (or .csv), but that file only has information in samples where it's called, not across all samples. Look at multisample as well, which is across all samples but VAF information. I haven't touched the multisample output in a few years though, so that might be a bit dated. The scatter plots, especially clones.png, can be a good viz of the VAFs in different clones otherwise, but maybe you want the numbers.

The best way to access the raw data otherwise is from the R output in Rdirectory/myIndividual/allVariants.Rdata, which is a nested list where allVariants$variants$variants$mySample is a data frame with all information about all variants that are present in the VCF from any sample in that individual.

Apr 09 '21 01:04 ChristofferFlensburg

superFreq superFreq copied to clipboard

Explanation of fields in rivers files

superFreq
superFreq copied to clipboard