cnvkit Call function return infinite number without log2 and abs CN.

Got an issue when I used the "call" function to calculate the absolute copy number with CI filter, tumor purity, and mode "clonal". I applied segmetrix with CI to generate CNS file. The cnvkit version is 0.9.8. The command I uses is here:

cnvkit.py segment -p 30 --drop-low-coverage -o CNS_file CNR_file cnvkit.py segmetrics CNR_file -s CNS_file --drop-low-coverage -o Segmetrics_CNS --ci --bootstrap 100 cnvkit.py call Segmetrics_CNS --filter ci -m clonal --purity 0.6 --drop-low-coverage

Here is the output example and I also attached the whole chr7 file for testing:

Part of CNR file chromosome start end gene depth log2 weight chr7 55142130 55142501 EGFR 270.456 0.708584 0.948729 chr7 55143245 55143516 EGFR 224.114 0.42487 0.930992 chr7 55146501 55146797 EGFR 213.872 0.90014 0.902366 chr7 55151199 55151407 EGFR 255.567 0.896409 0.930861 chr7 55152412 55152490 EGFR 297.462 1.14185 0.886307 chr7 55152492 55152694 EGFR 356.391 0.961422 0.92668 chr7 55153921 55154196 EGFR 206.844 0.446968 0.60015 chr7 55155739 55155982 EGFR 14177.8 5.93676 0.898039 chr7 55156458 55156538 EGFR 9044.49 6.28239 0.802126 chr7 55156556 55156631 EGFR 11817.4 6.30456 0.825121 chr7 55156658 55156867 EGFR 12254 6.28782 0.92162 chr7 55157596 55157772 EGFR 12223.5 6.41487 0.844637 chr7 55160097 55160289 EGFR 13647.7 6.61096 0.940684 chr7 55160291 55160391 EGFR 14064.7 6.54615 0.907539 chr7 55161417 55161665 EGFR 10694.9 6.21642 0.612706 chr7 55163677 55163829 EGFR 6728.44 6.22588 0.830213 chr7 55165247 55165364 EGFR 11537 6.62981 0.902731 chr7 55165371 55165513 EGFR 14056.2 6.7279 0.89137 chr7 55166188 55166353 EGFR 11421.7 6.25913 0.801735 chr7 55168413 55168609 EGFR 9115.29 6.2084 0.893011 chr7 55170198 55170401 EGFR 10596.3 6.21157 0.850611 chr7 55170423 55170631 EGFR 10044.1 6.20809 0.697185 chr7 55171092 55171272 EGFR 9704.37 6.42278 0.826387 chr7 55172882 55173156 EGFR 11007.5 6.12229 0.891947 chr7 55173779 55174097 EGFR 12955.6 6.50511 0.85122 chr7 55174683 55174860 EGFR 11168.8 6.42879 0.803283 chr7 55181183 55181542 EGFR-AS1 10416.9 6.36304 0.715062 chr7 55191654 55191774 EGFR 11614 6.53411 0.854608 chr7 55191797 55191940 EGFR 13835.6 6.5693 0.862205 chr7 55192693 55192884 EGFR 10807.4 6.33958 0.886439 chr7 55198649 55198933 EGFR 8280.11 6.07231 0.865077 chr7 55200232 55200463 EGFR 16582.2 6.07382 0.838151 chr7 55201129 55201298 EGFR 8808.38 6.1591 0.877357 chr7 55201650 55201844 EGFR 13407.9 6.29408 0.905556 chr7 55202531 55202837 EGFR 13588.1 6.31954 0.785208 chr7 55205187 55205435 EGFR 13262.6 6.47704 0.898544 chr7 55205439 55205513 EGFR 13652.4 6.38079 0.802199 chr7 55205559 55205673 EGFR 14608.6 6.46327 0.823021 chr7 55205719 55205896 EGFR 20892.3 6.29991 0.904913

Part of CNS chromosome start end gene log2 depth probes weight chr7 55142130 55154196 EGFR 0.797524 263.491 7 6.12608 chr7 55155739 55431638 EGFR;EGFR-AS 6.31171 11736.8 45 38.5273

Part of segmetrix CNS chromosome start end gene log2 depth probes weight ci_lo ci_hi chr7 55142130 55154196 EGFR 0.797524 263.491 7 6.12608 0.604312 0.954002 chr7 55155739 55431638 EGFR;EGFR-AS 6.31171 11736.8 45 38.5273 6.25559 6.36341

Part of call CNS chromosome start end gene log2 cn depth probes weight chr7 44034105 72278898 EGFR;EGFR-AS -9223372036854775808 1748 1422.59

example.cnr.txt example.cns.txt example.segmetrics.call.cns.txt example.segmetrics.cns.txt

Jul 13 '21 05:07 zhuhenan

Hi @zhuhenan ,

Not an author of CNVkit, but thanks a bunch for your very detailed issue and shared files (very helpful to reproduce) !

Precisions about your issue

Could reproduce from call step on my own machine and this happens with CNVkit v0.9.9 too (and v0.9.10.dev0)
Minimal command to reproduce is: cnvkit.py call -o /dev/stdout --filter ci -m clonal --purity 0.6 example.segmetrics.cns.txt (then to catch problematic segment: | grep -E "^chrom|EGFR" | cut -f4 --complement)
Having only segments intersecting "chr7:44034105-72278898" is enough to reproduce (attached, this sub-cns file with simplified gene annotation)
Not any error message produced

Detailed tracking

Starting at this line, we are sent to segfilters module (and ci function)
As every consecutive segment is of type "gain", calling squash_by_groups(segarr, pd.Series(levels)) returns a single squashed segarr => But above all (and heart of the problem IMHO): this squashed segarr has "log2=nan" (and also "depth=nan")
Then we go back to call module and as we are in -m clonal --purity, this part is executed: https://github.com/etal/cnvkit/blob/25d91cdb962639c98d041c3b05c7de659c41dc70/cnvlib/call.py#L28-L35
Result of absolute_clonal() call is np.array([np.nan]) (i.e.: 1-sized array with a NaN inside) => Which causes observed "infinite-like" CN value of "-9223372036854775808", during rounding operation: https://github.com/etal/cnvkit/blob/25d91cdb962639c98d041c3b05c7de659c41dc70/cnvlib/call.py#L53

Remark; Result is exactly the same with simply -m clonal (no purity specified), but with absolute_pure() function

Possible origin

Looking more carefully at shared files, I noticed within "chr7:44034105-72278898" there are 2 low-coverage segments with no weight value (and more in whole "example.segmetrics.cns.txt") => Can be found back in shared ".cnr", always corresponding to low-coverage bins (depth==0) => Filling these with dummy weight values within "segmetrics.cns" is enough to stop having "infinite-like" CN after call -m clonal
Also grouping operation causing "log2=depth=NaN", even though "NaN" values within original segarr were at "weight" column and not "log2" nor "depth" ones, can be explained by: => np.average(x, weight=y) returning NaN when any NaN in provided weights (see SO topic) => And this is done twice within squash_region(): for log2 and for depth columns => Maybe a sanity check should be added somewhere upstream in call process ?

Other proposition

I found "squashing" procedure very disturbing at 1st sight => But this is actually well described in this section of documentation (last paragraph) => But maybe this could be a bit more highlighted ? (as a "Note" for example) => Especially as --filter ci is made automatically within batch to produce ".call.cns"

Hope this helps. Have a nice day. Felix.

Jul 13 '21 14:07 tetedange13

@zhuhenan , to sum up my above answer: looks like you have a few segments with no (empty) "weight" values in your ".cns" files => They are found also within your shared ".cnr" (as they have been "propagated")

(could be an expected behaviour with low-coverage bins, not 100% sure)

Could you please give us details about how you obtained this ".cnr" ? => Some info about your dataset => CNVkit commands you ran with their parameters

Thanks again. Kind regards. Felix.

Jul 13 '21 14:07 tetedange13

Hi @tetedange13,

Thank you for the investigation. I used "batch" method to generate cnr file. When I checked the cnr file, only antitarget regions didn't have weight values. This also happened when I checked the full cnr file, it looks like all my cnr antitarget regions didn't have weight values.

Here is the command I used: cnvkit.py batch [all_tumor_bam] -n [all_normal_bam] --drop-low-coverage -p 30 -f [hg38.fasta] -t [SeqCab liftover hg38 captured target bed] --annotate [UCSC hg38 reflat bed] -g [5k access file] --output-reference [pool normal output]

Best, Henan

Jul 14 '21 17:07 zhuhenan

Hi @zhuhenan, thanks for your feedback ! A few more questions if you do not bother:

About your current files

Regarding ".cnr" you shared, all "Antitarget" with empty "weights" have also depth == 0 => Is it also the case for rest of your ".cnr" ? => Because it should not, you should also have some "Antitarget" with depth != 0 (even if very low depth)
Also as you are running on "[all_tumor_bam]", could you tell us if you are experiencing this "empty-weights" issue on all your "Sample-tumor.cnr"? Or only on several? Or only on a single sample?
Could you check for any empty values among each of your "sample.antitargetcoverage.cnn" files? And also among "Antitarget" regions from your "[pool normal output].cnn" ? (like in "depth" or "log2" columns)

Keep aside your current files and then

If you can, please start by updating to CNVkit v0.9.9, to be sure to exlude something fixed since v0.9.8 => Then maybe re-run same batch command and see if produced ".cnr" files still have these empty weight values ? => And if this does not take forever, could be ideal to run with batch -p 1 (some errors are hard to catch when parallelized)
If all your ".cnn" files are sane (no empty values somewhere), it is most likely something coming from fix step => Try to simply run: cnvkit.py fix sample.targetcoverage.cnn sample.antitargetcoverage.cnn [pool normal output].cnn => With a sample that showed empty-weights before and check produced ".cnr" for empty-weights at "Antitargets" ?

Thanks again ! Best, Felix.

Jul 16 '21 13:07 tetedange13

Hi @tetedange13,

I figure out why my antitarget regions only showed 0 in all cnr file. Somehow, my pipeline only printed reads in the target regions into the bam file during the base quality score recalibration.

Thank you so much for answering my post.

Best, Henan

Jul 19 '21 19:07 zhuhenan

Hi @zhuhenan,

Thanks for your feedback! However I am not sure I well understand your answer

Where are the "0" in all your ".cnr" files? In the "depth" column? You mean that all your "Antitarget" regions had depth == 0 ?
The problem was in upstream step you used to generate your BAM ? Your BAM had absolutely 0 reads overlapping "Antitarget" regions?
Is the fact of correcting your BAM allowed you to have proper CNVkit results?

Thanks again for your answer. Have a nice day. Felix.

Jul 20 '21 07:07 tetedange13

Hi @tetedange13,

Sorry for the late reply. I just finished the rerun of all my bam files. Now I have proper CNVkit results.

I used GATK best practices pipeline for the bam file. I applied the sequencing target file as an interval list as all my samples are WES. So during the base quality score recalibration step, the pipeline only print reads on the interval list. This causes no coverage of the anti-target region issue. However, I don't know why some of my samples have anti-target region coverage but some of them do not. But I don't have time to diagnose further now.

Thank you for helping me with this issue, you can close this post now.

Best, Henan

Sep 16 '21 18:09 zhuhenan

Hi @zhuhenan,

Thanks for your feedback, I am glad it helped !

To conclude: This was due to an upstream step (BQSR) that caused exclusion of any read not overlapping your "panel.bed". Then:

These "depth==0" at antitarget regions became "NaN" values in weight and log2 (this part is still a bit unclear)
"log2==NaN" situation further caused absolute_*() functions to return an infinite-like value during rounding operation, without any error

Maybe someone can try to reproduce this, running your batch command above on a BAM with reads overlapping antitargets filtered out first (something like samtools view -L vendor.bed sample.bam > for_CNVkit_DEBUG.bam)
Hope this helps. Have a nice day. Felix.

Sep 17 '21 10:09 tetedange13