SigProfilerMatrixGenerator icon indicating copy to clipboard operation
SigProfilerMatrixGenerator copied to clipboard

Non-integer copy numbers

Open hylkedonker opened this issue 2 years ago • 2 comments

Hi,

I would like to use the generateCNVMatrix function on copy-number files generated by PURPLE. When I look through:

  • The example file all.breast.ascat.summary.sample.tsv,
  • the code of generateCNVMatrix,

it seems that the total copy number value is always assumed to be integer valued (e.g., Tumour TCN=5 in the example file).

However, PURPLE generates floating point copy numbers. So it would have fields with the values copyNumber=2.8189, MinorAlleleCopyNumber=0.8165 and majorAlleleCopyNumber=2.0076.

What would be a consistent way to "round" the floating points to integers? For example,

0.01 -> 1
1.4 -> 2
2.4 -> 2
2.5 -> 3

?

If there is a consistent way to round, then I would be more than happy to make a PR to implement support for the PURPLE format in generateCNVMatrix.

Thanks,

Hylke

hylkedonker avatar Jun 22 '22 09:06 hylkedonker

Hi, thanks for bringing up this point! In the past, I have rounded to the nearest whole number (for A and B alleles) and then TCN = A + B. I am happy to add support for the PURPLE format, and it would be great if you could send me an example PURPLE file.

azhark2 avatar Jun 22 '22 18:06 azhark2

Thanks for your prompt response. I've created a pull request which should address both points: https://github.com/AlexandrovLab/SigProfilerMatrixGenerator/pull/81 Thanks! Hylke

hylkedonker avatar Jun 23 '22 11:06 hylkedonker

Thanks for reaching out, @azhark2 added support for PURPLE file format in v1.2.9 release.

mdbarnesUCSD avatar Sep 12 '22 20:09 mdbarnesUCSD

Hi, I am commenting on this old thread because I have tested matrix generation for PURPLE files and came across some limitations of the rounding.

First, be aware that because of rounding error, nothing ensures that round(A) + round(B) = round(A+B), for instance we can have A=0.55, B=0.55, ploidy = 1.1, and then the current solution leads to a spurious diploid region with round(A) = round(B) =1 and round(A) + round(B) = 2. It might be more prudent, if the goal is to obtain TCN and BCN, to rather use round(copyNumber) (the actual total copy number from PURPLE) and round(minorAlleleCopyNumber) in the code to avoid adding up rounding errors to compute the total copy number, although there would be cases again where this is different from round(copyNumber) - round(majorAlleleCopyNumber)...

Second, because of the rounding, some neighboring segments with close but non-identical decimal copy number values end up as being different segments although they have the exact same minor and major copy number after rounding, for example if one has copyNumber = 2.1 and the other 1.9... In my case this led to something that was attributed to artifactual signature CN23, although the decomposition was quite imperfect because I did have the excess of 1-10Mb segments characteristic of CN23 because of hypersegmentation, but no HD segments. I solved the issue by merging neighboring identical segments before running matrixGenerator, which made CN23 disappear and the corresponding segments to be correctly interpreted as part of CN1 instead.

I don't know what you want to do with this information, adapt a bit the code to import PURPLE outputs, or just put a warning in the documentation for people using PURPLE that they have to be careful and probably clean the input files themselves beforehand. I guess it is the consequence of a more general issue with PURPLE output interpretation : decimal copy number indicate subclonality, so rounding sort of focuses on clonal segments and ignores subclonal ones, but what is the correct threshold to say that something is subclonal is unclear.

Thanks again for the great tools!

Cheers,

Nicolas

nalcala avatar Apr 16 '23 14:04 nalcala

Hi Nicolas, thanks for the info, this is very interesting information. For your first point, I agree that round(copyNumber) may be better than round(A) + round(B), as shown by your example. I will work on changing that for PURPLE. As for your 2nd point, this is getting more into the specifics of PURPLE output and how the tool deals with the issue of clonality. The merging of segments is something that should be done prior to matrix generation, preferably automatically by the CN caller. Different callers will deal with clonality in different ways, for example, Battenberg will output an integer CN for each subclone, and in this case we will add these up (so a channel will consist of counts from all observations, including those coming from different clones). I'll consider adding a section to the documentation about the issue of clonality, which only really pops up for a few WGS callers.

azhark2 avatar Apr 20 '23 16:04 azhark2

I will work on changing that for PURPLE.

Has this issue for PURPLE been finalised? HMFTools has Sigs. But it is only for single nucleotide variants.

DarioS avatar May 09 '24 20:05 DarioS

Hi Dario, Thanks for following up on this. To my knowledge nothing has changed in HMFTools, so what we do is first round the total and minor copy numbers from the purple segmentation file as I explained above, then convert to the standard format with columns "sample", "Chromosome", "Tumour TCN", "Tumour BCN", and then run this little R script to connect neighbouring segments:

library(tidyverse)

CNVs = read_tsv("purple.cnv.somatic_sigprofilerformat.tsv")

CNVs.reseg = CNVs[1,]
j=2
for(i in 2:nrow(CNVs)){
  if( all(CNVs[i,c("sample","Chromosome","Tumour TCN","Tumour BCN")] == CNVs[i-1,c("sample","Chromosome","Tumour TCN","Tumour BCN")]) & 
      (CNVs$`End Position`[i-1]+1)==CNVs$`Start Position`[i] ){
    CNVs.reseg$`End Position`[j-1] = CNVs$`End Position`[i]
    print(c("merged segments:") )
    print(CNVs[(i-1):i,] )
  }else{
    CNVs.reseg = bind_rows(CNVs.reseg,CNVs[i,])
    j=j+1
  }
}

write_tsv(CNVs.reseg,"purple.cnv.somatic_sigprofilerformat_resegmented.tsv")

nalcala avatar May 10 '24 11:05 nalcala