salmon icon indicating copy to clipboard operation
salmon copied to clipboard

Integer overflow in metadata info file

Open gringer opened this issue 3 years ago • 2 comments

Using salmon alevin v1.9.0, I noticed that my total reads were less than the deduplicated UMI count when I combined all three libraries together (from a NovaSeq run):

{
    "total_reads": 284216343,
    "reads_with_N": 165542,
    "noisy_cb_reads": 1240522569,
    "noisy_umi_reads": 6297,
    "used_reads": 3338489231,
    "mapping_rate": 52.32744469106451,
    "reads_in_eqclasses": 2396169786,
    "total_cbs": 48399818,
    "used_cbs": 867051,
    "initial_whitelist": 49593,
    "low_conf_cbs": 1000,
    "num_features": 5,
    "final_num_cbs": 40432,
    "deduplicated_umis": 359865640,
    "mean_umis_per_cell": 8900,
    "mean_genes_per_cell": 2814
}

I suspect this has happened due to an integer overflow: 284216343 + 2^32 = 4579183639, which matches the total count that I get when I add the total reads from each barcoded sample together:

==> salmon_1.9_OG_2022-Oct-13_S1/aux_info/alevin_meta_info.json <==
{
    "total_reads": 1550672340,
    "reads_with_N": 56210,
    "noisy_cb_reads": 465287865,
    "noisy_umi_reads": 3313,
    "used_reads": 1085324952,
    "mapping_rate": 52.507052779441469,
    "reads_in_eqclasses": 814212344,
    "total_cbs": 32341973,
    "used_cbs": 1550909,
    "initial_whitelist": 28000,
    "low_conf_cbs": 991,
    "num_features": 5,
    "final_num_cbs": 18888,
    "deduplicated_umis": 113155025,
    "mean_umis_per_cell": 5990,
    "mean_genes_per_cell": 2035
}

==> salmon_1.9_OG_2022-Oct-13_S2/aux_info/alevin_meta_info.json <==
{
    "total_reads": 1371374162,
    "reads_with_N": 50003,
    "noisy_cb_reads": 389036191,
    "noisy_umi_reads": 3005,
    "used_reads": 982284963,
    "mapping_rate": 54.0580725189425,
    "reads_in_eqclasses": 741338439,
    "total_cbs": 30332499,
    "used_cbs": 1470602,
    "initial_whitelist": 28000,
    "low_conf_cbs": 997,
    "num_features": 5,
    "final_num_cbs": 19134,
    "deduplicated_umis": 127624221,
    "mean_umis_per_cell": 6670,
    "mean_genes_per_cell": 2229
}

==> salmon_1.9_OG_2022-Oct-13_S3/aux_info/alevin_meta_info.json <==
{
    "total_reads": 1657137137,
    "reads_with_N": 59329,
    "noisy_cb_reads": 447471964,
    "noisy_umi_reads": 3629,
    "used_reads": 1209602215,
    "mapping_rate": 55.061293216313938,
    "reads_in_eqclasses": 912441138,
    "total_cbs": 33411349,
    "used_cbs": 1567701,
    "initial_whitelist": 28000,
    "low_conf_cbs": 997,
    "num_features": 5,
    "final_num_cbs": 18395,
    "deduplicated_umis": 125889439,
    "mean_umis_per_cell": 6843,
    "mean_genes_per_cell": 2248
}

To Reproduce Steps and data to reproduce the behavior:

  1. Run salmon alevin on more than 2^32 sequenced reads

Specifically, please provide at least the following information:

  • Which version of salmon was used? v1.9.0
  • How was salmon installed (compiled, downloaded executable, through bioconda)? binary download from github
  • Which reference (e.g. transcriptome) was used? Gencode Human v41 + CHM13 v2.0 assembly
  • Which read files were used? BD Rhapsody + NovaSeq
  • Which which program options were used?
[cell barcodes were pre-corrected and merged using my own [custom script](https://gitlab.com/gringer/bioinfscripts/-/blob/master/synthSquish.pl)]
salmon alevin -l ISR \
  -1 $(ls demultiplexed/squished_${machineID}*_R1_001.fastq.gz | sort) \
  -2 $(ls demultiplexed/${machineID}*_R2_001.fastq.gz | sort) \
  -i ${indexDir}/${indexName} --expectCells ${expectCellCount} \
  -p 10 -o salmon_1.9_cbc_${projectID}_combined --tgMap ${indexDir}/txp2gene_${targetName}.txt \
  --umi-geometry '1[28-35]' --bc-geometry '1[1-27]' --read-geometry '2[1-end]'

Expected behavior

{
    "total_reads": 4579183639,
    "reads_with_N": 165542,
    "noisy_cb_reads": 1240522569,
    "noisy_umi_reads": 6297,
    "used_reads": 3338489231,
    "mapping_rate": 52.32744469106451,
    "reads_in_eqclasses": 2396169786,
    "total_cbs": 48399818,
    "used_cbs": 867051,
    "initial_whitelist": 49593,
    "low_conf_cbs": 1000,
    "num_features": 5,
    "final_num_cbs": 40432,
    "deduplicated_umis": 359865640,
    "mean_umis_per_cell": 8900,
    "mean_genes_per_cell": 2814
}

Desktop (please complete the following information):

  • OS/Version: Linux musculus 5.18.0-4-amd64 #1 SMP PREEMPT_DYNAMIC Debian 5.18.16-1 (2022-08-10) x86_64 GNU/Linux

gringer avatar Oct 17 '22 09:10 gringer

Thanks for this bug report @gringer! I have pushed a change to develop that should address this. Would you need me to produce an executable to test this out?

rob-p avatar Oct 27 '22 18:10 rob-p

No, it's fine. I understand the error, it doesn't affect any of my workflow, and I can easily compensate for it.

gringer avatar Oct 27 '22 20:10 gringer