ichorCNA icon indicating copy to clipboard operation
ichorCNA copied to clipboard

Negative log2ratios are called GAIN

Open lbeltrame opened this issue 6 years ago • 4 comments

Looking through a rather "unusual" sample (a cell line mixture, used to do preliminary evaluation of this program) I noticed that many negative ratios are called GAIN:

6       28500001        29000000        3       GAIN    -8.7995 0

This is inconsistent with the actual log2ratio value.

lbeltrame avatar Apr 19 '18 11:04 lbeltrame

Hi Luca,

This is likely due to an incompatible or incorrect solution. Data points that have extreme values in the negative direction may sometimes be label-swapped in the prediction because these points do not fall within expected (or estimated) CN levels. Without knowing any further details about your situation, I need to follow up with several questions and possible suggestions based on your scenario.

  1. Is this the optimal solution? Are there other solutions with higher estimated tumor fractions that do not have this issue?

  2. What are your initial values of the --normal argument? Looking at the extreme magnitude of your deletion signal (negative log ratio), it appears you have a high tumor fraction sample (also since you noted this is a cell-line). You may consider initializing --normal c(0.1, 0.2, 0.3, ... , 0.7, 0.8, 0.9). The 0.1 initialized solution may better account for your expected high tumor fraction.

  3. This is likely a homozygous deletion. Have you used the option --includeHOMD? This opens up another state, expecting log ratios even lower than the single copy deletion DEL.

  4. What is your bin size? I believe you are using 500kb bins which is fine. If you decide to use smaller bins, then be aware of germline events, which are smaller, if you do not use a matched normal.

This is what I can think of for now.

Best, Gavin

gavinha avatar Apr 19 '18 13:04 gavinha

Hello Gavin,

First of all I forgot to add that this same extreme data point is found also by QDNAseq with the same binning strategy, so it is either real or a data artifact from the start.

The samples are cell lines from which the surnatant was taken and DNA extracted to simulate the ctDNA concentrations without risking precious samples. Of these I have 100% cells of one type, or mixtures of two (50-50, 90-10, 95-5). Depth is about 0.2X.

Is this the optimal solution? Are there other solutions with higher estimated tumor fractions that do not have this issue?

Lower fractions among the estimated don't have this. But given what you wrote below, it may be worth testing again first.

You may consider initializing --normal c(0.1, 0.2, 0.3, ... , 0.7, 0.8, 0.9). The 0.1 initialized solution may better account for your expected high tumor fraction.

I will do so, thanks. Should this be done also for samples of which I know the percentage of "normal" (the other cell line)? I thought I could use "c(0.5, 0.9, 0.95)" in this case as I know the exact relative amounts.

What is your bin size? I believe you are using 500kb bins which is fine.

We're using 500kb because I saw that lower amounts just increased the noise and raised the MAD.

lbeltrame avatar Apr 19 '18 14:04 lbeltrame

For the record, it looks the solution with 0.9 is the one picked up when using the range specified here. I still see the region called as gain in one sample but I'm seeing the drop in ratio also in others. It may be as well an artifact but I want to make sure..

EDIT: I didn't say it before, but I'm not including HOMD events (---includeHOMD False).

lbeltrame avatar Apr 19 '18 14:04 lbeltrame

Would be good to hear about your results after trying out a few of these parameter changes.

You'll want to use --includeHOMD True if you do anticipate HOMD events such as this outlier datapoint.

gavinha avatar Apr 19 '18 18:04 gavinha