rnaseq
rnaseq copied to clipboard
all `salmon.merged.gene_tpm*.tsv` files contain the same exact values
Description of the bug
salmon_tximport.r script produces three files with TPM values at gene-level:
salmon.merged.gene_tpm.tsv: stores the raw abundances produced by salmon at transcript level and summarized at gene level.salmon.merged.gene_tpm_scaled.tsv: stores the same abundances as insalmon.merged.gene_tpm.tsvbut normalized by library size.salmon.merged.gene_tpm_length_scaled.tsv: stores the same abundances as insalmon.merged.gene_tpm.tsvbut normalized by library size AND average transcript length.
As far as I understand, these three files should provide different values, but they are all identical.
$ cut -f2-8 salmon.merged.gene_tpm.tsv | head
gene_name X04S X04T X10S X10T X11S.FIDIS X11T.FIDIS
TSPAN6 24.132338 2.917993 54.918612 5.67429 19.277248 2.545522
TNMD 0.207567 0 0.105188 0 0 0
DPM1 18.345738 25.45351 18.982674 15.297568 23.218046 33.745858
SCYL3 3.834507 2.412431 4.602108 1.664488 3.152804 3.360443
C1orf112 1.694871 2.447275 1.708524 1.110848 1.143803 2.322585
FGR 2.92266 7.166522 0.787274 1.529363 0.639491 1.055823
CFH 24.313129 20.593207 24.559955 24.067773 62.227481 7.129656
FUCA2 6.561708 8.532212 5.228672 8.413182 8.060727 5.354021
GCLC 59.804869 31.302783 14.608898 10.564397 81.377122 12.485694
$ cut -f2-8 salmon.merged.gene_tpm_scaled.tsv | head
gene_name X04S X04T X10S X10T X11S.FIDIS X11T.FIDIS
TSPAN6 24.132338 2.917993 54.918612 5.67429 19.277248 2.545522
TNMD 0.207567 0 0.105188 0 0 0
DPM1 18.345738 25.45351 18.982674 15.297568 23.218046 33.745858
SCYL3 3.834507 2.412431 4.602108 1.664488 3.152804 3.360443
C1orf112 1.694871 2.447275 1.708524 1.110848 1.143803 2.322585
FGR 2.92266 7.166522 0.787274 1.529363 0.639491 1.055823
CFH 24.313129 20.593207 24.559955 24.067773 62.227481 7.129656
FUCA2 6.561708 8.532212 5.228672 8.413182 8.060727 5.354021
GCLC 59.804869 31.302783 14.608898 10.564397 81.377122 12.485694
$ cut -f2-8 salmon.merged.gene_tpm_length_scaled.tsv | head
gene_name X04S X04T X10S X10T X11S.FIDIS X11T.FIDIS
TSPAN6 24.132338 2.917993 54.918612 5.67429 19.277248 2.545522
TNMD 0.207567 0 0.105188 0 0 0
DPM1 18.345738 25.45351 18.982674 15.297568 23.218046 33.745858
SCYL3 3.834507 2.412431 4.602108 1.664488 3.152804 3.360443
C1orf112 1.694871 2.447275 1.708524 1.110848 1.143803 2.322585
FGR 2.92266 7.166522 0.787274 1.529363 0.639491 1.055823
CFH 24.313129 20.593207 24.559955 24.067773 62.227481 7.129656
FUCA2 6.561708 8.532212 5.228672 8.413182 8.060727 5.354021
GCLC 59.804869 31.302783 14.608898 10.564397 81.377122 12.485694
When looking at the salmon.merged.gene_counts*.tsv files, the values are different though:
$ cut -f2-8 salmon.merged.gene_counts.tsv | head
gene_name X04S X04T X10S X10T X11S.FIDIS X11T.FIDIS
TSPAN6 993 122 2761 398 1504 221
TNMD 3 0 2 0 0 0
DPM1 267.001 423.999 325 340 496.999 708.001
SCYL3 158 112.999 250 149 243 232
C1orf112 65 89.001 59 61.999 65 143
FGR 55.999 228.001 27 91 38 38.001
CFH 1233.141 1183.906 1511.945 2007.725 4892.099 547.579
FUCA2 199 310 217 469 422 278
GCLC 1471.001 839 477.999 394.001 3622 350.999
$ cut -f2-8 salmon.merged.gene_counts_scaled.tsv | head
gene_name X04S X04T X10S X10T X11S.FIDIS X11T.FIDIS
TSPAN6 226.442607359473 27.325455413908 680.982329517792 64.2363448394888 254.038754538895 33.225507361704
TNMD 1.94767753881881 0 1.30431499756981 0 0 0
DPM1 172.144810281281 238.358609027664 235.382233640515 173.17758754902 305.971240742805 440.468891412456
SCYL3 35.9805901532686 22.5911356640093 57.0654303231929 18.8429962425592 41.5481712672496 43.862289791676
C1orf112 15.9035982496995 22.9174312268986 21.1854344308093 12.5754614572495 15.0732246406671 30.31561503522
FGR 27.4243942226085 67.110674146165 9.76207633377169 17.3133007041859 8.42731790237031 13.781163493836
CFH 228.139035837605 192.844451548677 304.539658954821 272.461535442591 820.044011019246 93.060063088992
FUCA2 61.5709206564035 79.8996360128384 64.8347274116187 95.2422347376289 106.225590279175 69.883530431172
GCLC 561.171092048839 293.13394555701 181.148084946645 119.595270723432 1072.40113945931 162.969920477208
$ cut -f2-8 salmon.merged.gene_counts_length_scaled.tsv | head
gene_name X04S X04T X10S X10T X11S.FIDIS X11T.FIDIS
TSPAN6 1009.89538613071 145.674915501123 3036.13600233514 425.158554183428 1412.3339507741 179.663001697246
TNMD 2.8139289243869 0 1.88384728463668 0 0 0
DPM1 234.747151082247 388.540561304232 320.883372483089 350.469267458502 520.123153074086 728.266655861836
SCYL3 166.298457316012 124.81222537306 263.669668052293 129.247383249636 239.381748242847 245.798951903383
C1orf112 51.9171282120011 89.4294020564928 69.1384720045292 60.9243711627601 61.339554864032 119.991487594531
FGR 74.6100360692731 218.248331939976 26.550306790444 69.9023760823351 28.5804305459881 45.4584769004731
CFH 1111.04633324021 1122.63568145619 1482.66680176865 1969.19909647884 4978.38996909827 549.496208248179
FUCA2 200.825423561577 311.520742893486 211.406273589949 461.025004293187 431.908131185053 276.367308907589
GCLC 1470.36420604633 918.11032098626 474.493771440154 465.0455540331 3502.72531314 517.733030792211
Therefore, either I misunderstand what salmon does (most likely) or there is a bug.
Command used and terminal output
$ nextflow -Dnxf.pool.type=sync -log nf.log run -with-timeline -with-trace -with-report -w work -dump-hashes nf-core/rnaseq -r 3.8.1 \
-profile singularity \
--outdir results \
--input "data/sample_data.csv" \
--fasta $(realpath data/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz) \
--gtf $(realpath data/Homo_sapiens.GRCh38.106.gtf.gz) \
--remove_ribo_rna \
--save_reference \
-process.cache='lenient' \
--ribo_database_manifest $(realpath data/rrna_db.txt)
Relevant files
No response
System information
- Nextflow version:
22.04.3 - Hardware: HPC
- Executor: Slurm
- Container engine: Singularity
- OS: CentOS 7.9
- Version of nf-core/rnaseq:
3.8.1
Hi @rob-p ! Would love your input on this please?
pinging @mikelove, who might have some insight on the tximport behavior here (or if the issue likely resides elsewhere).
tximport doesn’t modify abundances. So when you write out those matrices (as far as tximport is concerned) they would be unchanged from Salmon TPM column.