SCITE icon indicating copy to clipboard operation
SCITE copied to clipboard

SCITE does not validate matrix shape and fails silently

Open winni2k opened this issue 3 years ago • 2 comments

First of all, thank you scite authors for making scite available open source to the scientific community. This is a useful tool, and I hope I can help make it better.

I suspect I just lost three days of work spent troubleshooting scite to an off-by-one error in specifying -n. If scite validated the shape of the input matrix, then I would have not lost this time. Therefore, I am writing up this detailed bug report/feature request to help drive home how much I care about solving this issue. I also see that someone else appears to have an issue that would be solved by input matrix shape validation (https://github.com/cbg-ethz/SCITE/issues/3), although it looks like that person was lucky enough to get a segmentation fault instead of silent failure.


I suspect scite does not properly validate matrix shape. When I run scite with input matrix test.txt

1 0 0
0 1 0
0 0 1

and the following command (note the incorrect number 2 used for -n) I get the follwing output:

$ ./scite -i test.txt -seed 0 -m 3 -n 2 -r 1 -l 100 -fd 0.01 -ad 0.05 -p 1 -s
At mcmc repetition 1/1, step 0: best tree score -1.79769e+308 and best beta 0.05 and best overall score -1.79769e+308
best log score for tree:        -0.02456538848592614
#optimal steps after burn-in:   75
total #steps after burn-in:     75
%optimal steps after burn-in:   1
samples from posterior written to: test.samples
1 opt trees
Time elapsed: 1.0780000000000001 ms

and the following estimated MAP trees:

$  cat test_map0.newick
(1,2)3

However, the correct MAP tree I think should be (1,2,3)4. That is what I get when I run with -n 3 instead:

$ ./scite -i test.txt -seed 0 -m 3 -n 3 -r 1 -l 100 -fd 0.01 -ad 0.05 -p 1 -s
At mcmc repetition 1/1, step 0: best tree score -1.79769e+308 and best beta 0.05 and best overall score -1.79769e+308
best log score for tree:        -0.17961309251237217
#optimal steps after burn-in:   62
total #steps after burn-in:     75
%optimal steps after burn-in:   0.82666666666666666
samples from posterior written to: test.samples
1 opt trees
Time elapsed: 2.3730000000000002 ms

My c++ has gotten a bit rusty, but from looking at the code responsible for parsing the matrix, I suspect what is happening is that the file stream extractor treats spaces and new lines the same: (https://github.com/cbg-ethz/SCITE/blob/2016b31c2d22dd0b678972f8a33aeca17914a967/findBestTrees.cpp#L376-L380)

To test this hypothesis, I created the following input matrix test2.txt

1
0
0
0
1
0
0
0
1

This runs through just fine:

$ ./scite -i test2.txt -seed 0 -m 3 -n 3 -r 1 -l 100 -fd 0.01 -ad 0.05 -p 1 -s
At mcmc repetition 1/1, step 0: best tree score -1.79769e+308 and best beta 0.05 and best overall score -1.79769e+308
best log score for tree:        -0.17961309251237217
#optimal steps after burn-in:   62
total #steps after burn-in:     75
%optimal steps after burn-in:   0.82666666666666666
samples from posterior written to: test2.samples
1 opt trees
Time elapsed: 1.8580000000000001 ms

And I get the expected output:

$ cat test2_map0.newick
(1,2,3)4

I think the correct thing for scite to do is to check every character of the input matrix (including new lines, spaces, and unused characters) to make sure that the input matrix has the correct shape.

As this code uses an OSI-approved license I would be happy to make the fix if coding time is hard to come by for the original authors of scite. Please just let me know if this bug fix/feature request would be of interest.

winni2k avatar Dec 31 '20 15:12 winni2k