Different results for v1.2.0 and v1.2.2 and v1.3.0
Hello, thanks for updates for your tool.
We're trying to update our docker containers with your tool but registered different results for v1.2.0 and v1.2.2:
Run commands:
/soft/msisensor-pro-1.2.0/bin/msisensor-pro msi -d /ref/human_g1k_v37.list -n /inputs/normal.bam -t /inputs/tumor.bam -b 32 -o /outputs/output/test_pair
/soft/msisensor-pro-1.2.2/bin/msisensor-pro msi -d /ref/human_g1k_v37.list -n /inputs/normal.bam -t /inputs/tumor.bam -b 32 -p 8 -o /outputs/output/test_pair
First of all - number of sites and windows changed:
v1.2.0:
Total loading windows: 5804
Total loading homopolymer and microsatellites: 1786895
...
*** Summary information ***
Number of total sites: 1786895
Number of sites with enough coverage: 58
Number of MSI sites: 0
v1.2.2:
Total loading windows: 5696
Total loading homopolymer and microsatellites: 363133
...
*** Summary information ***
Number of total sites: 363133
Number of sites with enough coverage: 0
Number of MSI sites: 0
And more important - we have empty germline result now:
v1.2.0:
chromosome location left_flank_bases repeat_times repeat_unit_bases right_flank_bases genotype difference P_value FDR
19 2008302 CACTC 13 T CTAGG 13|12 0.21678 0.54982 1.063
19 2011861 TTTCC 10 T GAGAC 10|10 0 1 1.3488
19 2017120 CGTCT 6 CAAA ACAAA 6|6 0.08 0.28932 1.1986
19 2017151 AAAAC 10 A CTAGG 10|10 0.076167 0.57701 1.0458
19 2018642 TGGAC 12 T AAAGA 12|12 0.10526 0.33506 0.97168
19 2018961 ACAGG 6 CA CGGGA 6|6 0.076923 0.37455 0.90517
19 2019980 CTTTT 6 TTCC TTCTT 6|6 0 1 1.0175
19 2022596 TTTCC 11 T CCCTG 11|11 0.082596 0.81839 1.1577
19 2025188 AATAA 12 T GAGAC 12|12 0.42105 0.037295 2.1631
19 2025522 CCTTC 5 CT CCTTC 5|5 0 1 1.234
19 2027127 ACCAG 7 AC CAGAC 7|7 0.028169 0.80606 1.1688
19 2027144 ACCAG 7 AC CAGAC 7|7 0.028169 0.80606 1.1988
19 2027161 ACCAG 5 AC AGGGC 5|5 0 1 1.3182
19 2027495 GAGCC 10 AG AAAGA 10|10 0.21053 0.30538 1.1808
19 2028158 CCTCT 11 A TTACA 11|11 0 1 1.2609
19 2031306 ATCTC 12 A TAAAA 11|11 0.40209 0.1242 1.2006
19 2032700 GTGCC 5 TTCA ATCTG 5|5 0 1 1.2083
19 2036371 AAAGG 12 A TACAA 12|12 0.34862 0.48119 1.0734
19 2039434 ACAGG 10 A CCCAA 11|10 0.13397 0.46656 1.0824
19 2046939 ACTCA 5 GT GGATT 5|5 0 1 1.16
19 2049907 CAGAG 7 AC AAACA 7|7 0.097561 0.21243 1.369
19 2049923 CACAA 5 AC ATACA 5|5 0 1 1.0943
19 2052598 TAAGA 7 GT GCACG 7|7 0.071429 0.22779 1.101
19 2058828 GGCTT 5 TTTG TTTTG 4|5 0.28571 0.21417 1.1293
19 2068016 CGTTT 12 A TGTGT 12|12 0.19048 0.23128 1.0319
19 2069060 CCTTT 11 A TGGCA 11|11 0.09482 0.59809 1.0203
19 2069767 CTCAG 5 AACA ACTAG 5|5 0 1 1.0741
19 2074245 GAGAT 5 CAAA AAAAC 5|5 0 1 1.0545
19 2074718 GCTAA 11 T CCCCT 11|11 0 1 1.0357
19 2075031 TTTTC 15 T GAGAC 15|15 0.6255 0.12035 1.7451
19 2077400 GTCTC 11 A GAAAA 11|11 0.38462 0.18621 1.35
19 2077787 AATCC 13 T GAGAC 13|13 0.16667 0.51553 1.0679
19 2080383 ACCTC 11 T GAGGC 11|11 0 1 1
19 2083003 ATTTT 5 TTG TATTT 5|4 0.35897 0.071532 2.0744
19 2084890 CCTAG 10 T ATTTA 10|10 0 1 1.1373
19 2090669 CGTTA 11 T AAGAC 11|11 0.0625 0.30973 1.1228
19 2092000 AACAG 10 A GAGAA 10|10 0.076923 0.33953 0.93774
19 2093160 GGGAA 13 T CGAGA 13|13 0.24561 0.60897 1.0092
19 2095176 CTCAA 7 AAAC AAAAA 7|7 0.08 0.6604 1.0352
19 2095756 TGTTC 12 T CCCCC 12|12 0.094845 0.57166 1.0696
19 2103219 TTATT 11 A TGGGA 11|11 0.25 0.10777 2.0835
19 2125834 GCTAG 11 A CAAAA 12|12 0.27338 0.37271 0.93987
19 2133510 CCAGC 12 T GAGAC 12|12 0.086957 0.61346 0.98836
19 2137313 ATGTG 13 T CTTGG 13|13 0.13333 0.31937 1.0291
19 2140479 TTGAC 13 T GAGAC 13|14 0.32558 0.32124 0.98061
19 2142754 TTCTG 12 T CTTTT 11|11 0.10526 0.5009 1.076
19 2143763 GCCTC 10 A TAAAA 10|10 0.094708 0.57759 1.0152
19 2144904 TCCAA 5 AAAC AAAAC 5|5 0 1 1.1154
19 2145758 ACCTC 6 CA CCAAG 6|6 0 1 1.381
19 2147882 GTCTC 11 A GAAAA 11|11 0.13421 0.52212 1.0442
19 2152539 CGTCT 10 A GAGAG 10|10 0 1 1.1837
19 2157097 CCCAT 5 CATC CACCC 5|5 0.086957 0.67485 1.03
19 2158999 TATAA 11 T AATGG 11|10 0.27869 0.31511 1.0751
19 2159210 TTCCA 5 TC TTTTT 5|5 0.30769 0.12186 1.4136
19 2166076 ATTTA 5 TATGT TGTTA 5|5 0 1 1.2889
19 2172506 CTTTC 13 T GAAGA 13|13 0.2439 0.35051 0.92408
19 2177298 AACAA 12 T GGGAG 12|12 0.16 0.17247 1.4291
19 2183629 AGTGC 14 T AAGGC 14|14 0.26087 0.21288 1.2347
v1.2.2:
chromosome location left_flank_bases repeat_times repeat_unit_bases right_flank_bases genotype difference P_value FDR
Is this intended? Is this due to MinMicrosate and MaxMicrosate change in commit https://github.com/xjtu-omics/msisensor-pro/commit/3d3d7b8105168e522619b86b240623599dabd9bd ? Looks like now zero sites have enough coverage but covCutoff is same (15).
We've found reason for empty result - we used /ref/human_g1k_v37.list from v1.2.0 scan command for v1.2.2 (we don't know that they will change after update).
==> human_g1k_v37.list_v1.2.0 (71504K size) <==
chromosome location repeat_unit_length repeat_times repeat_unit_bases left_flank_bases right_flank_bases
1 10108 6 6 AACCCT AACCC AACCC
1 10147 6 5 CCCTAA CTAAC CCTAA
1 10177 6 9 CCTAAC CCTAA CCCTA
1 10255 6 5 AACCCT CCCTA AACCC
1 10285 6 6 AACCCC ACCCT AACCC
1 10352 6 6 ACCCTA ACCCT ACCCC
1 10397 6 7 CCCTAA CTAAC CCCCT
1 26453 2 6 GT TGTTA TTGCT
1 28588 1 15 T GGTGG GCATC
==> human_g1k_v37.list_v1.2.2 (54029K size) <==
chromosome location repeat_unit_length repeat_unit_binary repeat_times left_flank_binary right_flank_binary repeat_unit_bases left_flank_bases right_flank_bases
1 26453 2 11 6 956 999 GT TGTTA TTGCT
1 28588 1 3 15 698 589 T GGTGG GCATC
1 30867 2 7 12 885 319 CT TCTCC CATTT
1 30910 2 13 6 847 627 TC TCATT GCTAT
1 30935 2 13 6 255 1015 TC ATTTT TTTCT
1 31719 1 0 14 466 711 A CTCAG GTACT
1 31807 2 1 5 51 275 AC AATAT CACAT
1 33449 1 0 15 335 645 A CCATT GGACC
1 33530 1 3 11 1021 204 T TTTTC ATATA
After rescan we got new results:
v1.2.2:
Total loading windows: 5804
Total loading homopolymer and microsatellites: 1789437
*** Summary information ***
Number of total sites: 1789437
Number of sites with enough coverage: 63
Number of MSI sites: 0
chromosome location left_flank_bases repeat_times repeat_unit_bases right_flank_bases genotype difference P_value FDR
19 2008302 CACTC 13 T CTAGG 13|12 0.21678 0.54982 0.96218
19 2011861 TTTCC 10 T GAGAC 10|10 0.16667 0.13451 1.6948
19 2017120 CGTCT 6 CAAA ACAAA 6|6 0.08 0.28932 1.0126
19 2017151 AAAAC 10 A CTAGG 10|10 0.1875 0.52981 0.95366
19 2018642 TGGAC 12 T AAAGA 12|12 0.10526 0.33506 0.84436
19 2018961 ACAGG 6 CA CGGGA 6|6 0.076923 0.37455 0.78656
19 2019980 CTTTT 6 TTCC TTCTT 6|6 0 1 1.125
19 2022596 TTTCC 11 T CCCTG 11|11 0.082596 0.81839 1.0312
19 2025188 AATAA 12 T GAGAC 12|12 0.42105 0.037295 2.3496
19 2025522 CCTTC 5 CT CCTTC 5|5 0 1 1.2353
19 2027127 ACCAG 7 AC CAGAC 7|7 0.028169 0.80606 1.0364
19 2027144 ACCAG 7 AC CAGAC 7|7 0.028169 0.80606 1.058
19 2027161 ACCAG 5 AC AGGGC 5|5 0 1 1.0678
19 2027495 GAGCC 10 AG AAAGA 10|10 0.21053 0.30538 0.96194
19 2028158 CCTCT 11 A TTACA 11|11 0 1 1.05
19 2031306 ATCTC 12 A TAAAA 11|11 0.40209 0.1242 2.6082
19 2032700 GTGCC 5 TTCA ATCTG 5|5 0 1 1.0862
19 2036371 AAAGG 12 A TACAA 12|12 0.36184 0.50028 0.95507
19 2039434 ACAGG 10 A CCCAA 11|10 0.15982 0.4243 0.86228
19 2046939 ACTCA 5 GT GGATT 5|5 0 1 1.0328
19 2049907 CAGAG 7 AC AAACA 7|7 0.097561 0.21243 1.1153
19 2049923 CACAA 5 AC ATACA 5|5 0 1 1.1455
19 2052598 TAAGA 7 GT GCACG 7|7 0.071429 0.22779 0.89691
19 2058828 GGCTT 5 TTTG TTTTG 4|5 0.28571 0.21417 0.96376
19 2068016 CGTTT 12 A TGTGT 12|12 0.19048 0.23128 0.85711
19 2069060 CCTTT 11 A TGGCA 11|11 0.09482 0.59809 0.942
19 2069767 CTCAG 5 AACA ACTAG 5|5 0 1 1.1667
19 2074245 GAGAT 5 CAAA AAAAC 5|5 0.069767 0.29665 0.98364
19 2074718 GCTAA 11 T CCCCT 11|11 0 1 1.1887
19 2075031 TTTTC 15 T GAGAC 15|15 0.64407 0.14965 1.3468
19 2077400 GTCTC 11 A GAAAA 11|11 0.46667 0.21991 0.92363
19 2077787 AATCC 13 T GAGAC 13|13 0.16667 0.51553 0.95525
19 2080383 ACCTC 11 T GAGGC 11|11 0.16667 0.19761 1.2449
19 2083003 ATTTT 5 TTG TATTT 5|4 0.37014 0.12715 2.0027
19 2084890 CCTAG 10 T ATTTA 10|10 0 1 1.2115
19 2090669 CGTTA 11 T AAGAC 11|11 0.0625 0.30973 0.9292
19 2092000 AACAG 10 A GAGAA 10|10 0.076923 0.33953 0.8227
19 2093160 GGGAA 13 T CGAGA 13|13 0.24561 0.60897 0.93574
19 2095176 CTCAA 7 AAAC AAAAA 7|7 0.14815 0.6567 0.88025
19 2095756 TGTTC 12 T CCCCC 12|12 0.14894 0.55354 0.94251
19 2103219 TTATT 11 A TGGGA 11|11 0.25 0.10777 3.3947
19 2125834 GCTAG 11 A CAAAA 12|12 0.27338 0.37271 0.80967
19 2133510 CCAGC 12 T GAGAC 12|12 0.086957 0.61346 0.92019
19 2137313 ATGTG 13 T CTTGG 13|13 0.13333 0.31937 0.8748
19 2140479 TTGAC 13 T GAGAC 13|14 0.32558 0.32124 0.84324
19 2142754 TTCTG 12 T CTTTT 11|11 0.13481 0.58299 0.94176
19 2143763 GCCTC 10 A TAAAA 10|10 0.25 0.19776 1.1326
19 2144904 TCCAA 5 AAAC AAAAC 5|5 0 1 1
19 2145758 ACCTC 6 CA CCAAG 6|6 0 1 1.0161
19 2147882 GTCTC 11 A GAAAA 11|11 0.17647 0.57594 0.95484
19 2152539 CGTCT 10 A GAGAG 10|10 0.045195 0.36771 0.82735
19 2157097 CCCAT 5 CATC CACCC 5|5 0.20134 0.45914 0.90394
19 2158999 TATAA 11 T AATGG 11|10 0.27869 0.31511 0.90236
19 2159210 TTCCA 5 TC TTTTT 5|5 0.34545 0.16105 1.2683
19 2162681 CACTC 10 A CCCCA 10|9 0.47898 0.13651 1.4333
19 2166076 ATTTA 5 TATGT TGTTA 5|5 0 1 1.1053
19 2172506 CTTTC 13 T GAAGA 13|13 0.2439 0.35051 0.81787
19 2177298 AACAA 12 T GGGAG 12|12 0.16 0.17247 1.2073
19 2183629 AGTGC 14 T AAGGC 14|14 0.26087 0.21288 1.0317
19 2467148 AAAAC 10 A CAAAA 7|7 0.31579 0.62109 0.85063
19 2467159 AAAAC 10 A CAAAA 7|7 0.31579 0.62109 0.86953
19 2532160 AAAAC 11 A CAAAA 7|7 0.31579 0.62109 0.88929
19 2567130 AAAAC 10 A CAAAA 7|7 0.31579 0.62109 0.90997
Visual comparison (v1.2.0 - left, v1.2.2 - right):
Looks like major differences are numerical, but we have 5 new lines (as you can see on attached screenshot).
So we have 2 questions:
- Are these differences expected?
- Should we rescan our reference genomes every new msisensor version?
Thank you very much for your report. In versions prior to v1.2.2, if the repeat count of a read at a given site was less than the minimum repeat count (which can be set via the command line), the read would be filtered out, which is an unreasonable operation. As shown in your screenshot, all these records added in v1.2.2 have genotypes with fewer than 10 repeats of A repeat.
@Stikus For each version, we recommend that you use the same scan and msi, pro commands. Additionally, the latest version of msisensor-pro is v1.3.0. This version will be a long-term support version, and we welcome you to test it.
We'll test v1.3.0 today, when we test it we got error about same genome for scan and msi commands and decided to downgrade to 1.2.2 first. We completely forgot about separate scan step - now I understand it.
Thanks for detailed answer - I'll post v1.3.0 results here if you don't mind.
We'll test v1.3.0 today, when we test it we got error about same genome for
scanandmsicommands and decided to downgrade to 1.2.2 first. We completely forgot about separatescanstep - now I understand it.Thanks for detailed answer - I'll post v1.3.0 results here if you don't mind.
I would be very grateful. Please feel free to post here.
Can you help me with new naming?
Before we have "${outputPrefix}_somatic" and "${outputPrefix}_germline" - now we have "${outputPrefix}_unstable" and "${outputPrefix}_all". Is this correct match?
According to https://github.com/xjtu-omics/msisensor-pro/blob/35c39bb122e579c8fe97d405990a3d126030b850/cpp/sample.cpp#L40 - looks like correct.
Can you help me with new naming?
Before we have
"${outputPrefix}_somatic"and"${outputPrefix}_germline"- now we have"${outputPrefix}_unstable"and"${outputPrefix}_all". Is this correct match?According to
https://github.com/xjtu-omics/msisensor-pro/blob/35c39bb122e579c8fe97d405990a3d126030b850/cpp/sample.cpp#L40
- looks like correct.
Yes, to align with the results produced by the pro command, the _all table includes all microsatellite sites with sufficient coverage, while the _unstable table indicates unstable sites, meaning sites where somatic mutations have occurred.
Here is a comparison (v1.3.0 - left, v1.2.2 - right):
Total loading windows: 5812
Total loading homopolymer and microsatellites: 2678803
*** Summary information ***
Number of total sites: 2678803
Number of sites with enough coverage: 116
Number of MSI sites: 0
As you can see - genotype field is missing (but header have leftover tab - here is the reason) - is this expected?
New output contain twice more lines - 116 instead of 63.
Here is a comparison (
v1.3.0- left,v1.2.2- right):Total loading windows: 5812 Total loading homopolymer and microsatellites: 2678803 *** Summary information *** Number of total sites: 2678803 Number of sites with enough coverage: 116 Number of MSI sites: 0As you can see -
genotypefield is missing (but header have leftover tab - here is the reason) - is this expected? New output contain twice more lines - 116 instead of 63.
Yes, we removed the display of genotypes because many sites have undergone somatic mutations, making the genotypes here inaccurate. Additionally, for those sites, we adjusted the default minimum repeat count for homopolymers from 10 to 8, as in our experience, homopolymers with repeat counts of 8 and 9 also have many somatic mutations. The tab in the header is not as expected; I will remove it.
Correct me if I'm wrong - all new lines (green on screen) have repeat count 8 or 9. But old lines have repeat counts 5,6,7 and 10,11,12. So - how we get 5,6,7 in results if default was lowered from 10 to 8?
Upd. Or low repeat counts can be only for repeats with length more than one? So resulting length is above threshold?
Correct me if I'm wrong - all new lines (green on screen) have repeat count 8 or 9. But old lines have repeat counts 5,6,7 and 10,11,12. So - how we get 5,6,7 in results if default was lowered from 10 to 8?
Upd. Or low repeat counts can be only for repeats with length more than one? So resulting length is above threshold?
In the scan command, the -l and -r parameters control the minimum repeat count for single nucleotide repeats and other sites (where the repeat unit length is greater than 2), respectively. The default values for -l and -r are 8 and 5, respectively.
Great thx for answers, we'll use latest version now. The issue can be closed, I think.
FYI - we are using your tool with latest samtools without any problems with this patch before make:
sed -i "s|\$(realpath ../vendor/htslib-1.11)|$SOFT/samtools-$SAMTOOLS_VERSION-src/htslib-$SAMTOOLS_VERSION|" "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}-src/cpp/makefile"
Our full build part from Dockerfile:
ENV MSISENSOR_PRO_VERSION='1.3.0'
RUN cd "$SOFT" \
&& wget -q "https://github.com/xjtu-omics/msisensor-pro/archive/refs/tags/v${MSISENSOR_PRO_VERSION}.tar.gz" -O "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}.tar.gz" \
&& tar -xzf "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}.tar.gz" \
&& mv "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}" "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}-src" \
&& cd "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}-src/cpp" \
&& sed -i "s|\$(realpath ../vendor/htslib-1.11)|$SOFT/samtools-$SAMTOOLS_VERSION-src/htslib-$SAMTOOLS_VERSION|" "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}-src/cpp/makefile" \
&& sed -ie '/export LD_LIBRARY_PATH/s/^/#/' "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}-src/cpp/makefile" \
&& make -j"$(($(nproc)+1))" \
&& mkdir -p "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}/bin" \
&& cp "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}-src/cpp/msisensor-pro" "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}/bin" \
&& cd "$SOFT" \
&& rm -r "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}-src" \
&& rm "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}.tar.gz"
Thank you very much for providing the Docker file and for testing it. I will keep this issue open for a while; if you have any other questions, please feel free to post.
