Testing environment
Current tests are end-to-end integration tests that makes sure scripts execute successfully. There is much more testing that could be done including:
- [x] adding unit tests to the
testsfolder (lots can be copied fromCMash/MinHash.py) - [x] add additional unit tests as code coverage is rather low at this point
- [ ] test validity of Query module results
- [ ] automate script tests so manual checking is no longer required.
- [x] brute force calculate containment indicies and check that results of all scripts are within acceptable error ranges.
- [ ] add tests for the multiple k-mer size features in test_scripts
- [ ] Set up Travis CI
Some unit tests are in. MinHash module tests check validity of results. Query module is only really checking for code-breaking errors at this point, as there are a lot of FIXME's and TODO's.
Will need to:
- [ ] test validity of Query module results
- [ ] automate script tests so manual checking is no longer required.
- [ ] brute force calculate containment indicies and check that results of all scripts are within acceptable error ranges.
Will be tagging this as help wanted and assigning everyone, since all are welcome to contribute.
SOP: create new branch:
git checkout master
git pull origin master # make sure code is up to date
git checkout -b <some_feature_branch_name> # create a new branch implementing a new testing feature
# add your new feature
git commit -a # commit your contributions
git push origin <some_feature_branch_name> # push your changes to your feature branch
# then request a code review before merging to master
Note: while I assigned all, this is mainly a QOL (quality of life) issue: things that will make our future contributions easier in the future, but should not distract from main projects. i.e. as time permits.
@dkoslicki Make sure GroundTruth.py is identifying kmers and rc-kmers, not counting them as distinct.
./run_small_tests.sh
,k=10,k=12,k=14,k=16,k=18,k=20
taxid_1192839_4_genomic.fna.gz,1.0,1.0,1.0,1.0,1.0,1.0
taxid_28901_877_genomic.fna.gz,1.0,0.786,0.416,0.332,0.294,0.274
Ground truth on server since takes a fair bit of memory import CMash.GroundTruth as G query_file="/data/dmk333/repos/CMash/tests/Organisms/taxid_1192839_4_genomic.fna.gz" training_file="/data/dmk333/repos/CMash/tests/script_tests/TrainingDatabase.h5" g = G.TrueContainment(training_file, "10-21-2") df = g.return_containment_data_frame(query_file, -1, .1) print(df) k=10 k=12 k=14 k=16 k=18 k=20 taxid_1192839_4_genomic.fna.gz 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 taxid_28901_877_genomic.fna.gz 0.970794 0.648166 0.404911 0.336364 0.303958 0.279067
Well that looks pretty nice to me!
Switched to canonical k-mers to sanity check things, results basically unchanged: Ground truth on server since takes a fair bit of memory k=10 k=12 k=14 k=16 k=18 k=20 taxid_1192839_4_genomic.fna.gz 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000 taxid_28901_877_genomic.fna.gz 0.970735 0.64816 0.404912 0.336364 0.303959 0.279068
So we'll be sticking with canonical k-mers for the ground truth as it's much more straightforward to understand.
Note to self @dkoslicki: something odd is happening at small k-mer sizes: using run_comparison_to_ground_truth.sh via GroundTruth.py, in __return_containment_index:
return len(set1.intersection(set2)) / float(len(set1))
seems correct, but
return len(set1.intersection(set2)) / float(len(set2))
returns accurate small k-mer size results... eg.
import CMash.GroundTruth as G
training_database_file = "/home/dkoslicki/Desktop/CMash/tests/script_tests/TrainingDatabase.h5"
query_file1 = "/home/dkoslicki/Desktop/CMash/tests/Organisms/taxid_1192839_4_genomic.fna.gz"
query_file2 = "/home/dkoslicki/Desktop/CMash/tests/Organisms/taxid_562_8705_genomic.fna.gz"
g = G.TrueContainment(training_database_file, "4-6-1")
len(g.training_file_to_ksize_to_kmers[query_file1][4].intersection(g.training_file_to_ksize_to_kmers[query_file2][4]))/float(len(g.training_file_to_ksize_to_kmers[query_file1][4]))
1.0
len(g.training_file_to_ksize_to_kmers[query_file1][4].intersection(g.training_file_to_ksize_to_kmers[query_file2][4]))/float(len(g.training_file_to_ksize_to_kmers[query_file2][4]))
0.3056179775280899
And the StreamingQueryDNADatabase.py is returning a 1 (not the 0.3056).
Clearly, query_file2 is basically three copies of query_file1 at k=4, but why ok results at higher k-mer sizes?
Oh yeah, and StreamingQueryDNADatabase.py uses a heck of a lot of memory for small k-mer sizes. Probably khmer or screed's fault, but that's TBD.
Regarding direction of containment, I think the committed way is best: set1 as denom
Total error per k-mer size:
k=8 0.043016
k=10 0.354925
k=12 2.485572
k=14 0.690597
k=16 0.161794
k=18 0.076439
k=20 0.035385
k=22 0.008816
dtype: float64
set2 as denom:
Total error per k-mer size:
k=8 0.168598
k=10 2.173376
k=12 3.924027
k=14 0.832073
k=16 0.140583
k=18 0.047191
k=20 0.009990
k=22 0.018207
dtype: float64
But clearly something is up with k=12. Odd...
This is using run_comparison_to_ground_truth.sh with:
testOrganism="../Organisms/taxid_1192839_4_genomic.fna.gz"
maxK=22
kSizes="8-${maxK}-2"
numHashes=10000
containmentThresh=0
locationOfThresh=-1
But clearly something is up with k=12. Odd...
|true-CMash|:
| genome | k=8 | k=10 | k=12 | k=14 | k=16 | k=18 | k=20 | k=22 |
|---|---|---|---|---|---|---|---|---|
| taxid_1192839_4_genomic.fna.gz | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000e+00 |
| taxid_1307_414_genomic.fna.gz | 0.000761 | 0.049504 | 0.257595 | 0.048064 | 0.003751 | 0.000108 | 2.923078e-07 | 3.920249e-05 |
| taxid_1311_236_genomic.fna.gz | 0.001250 | 0.050466 | 0.278542 | 0.050275 | 0.003800 | 0.000344 | 7.373714e-05 | 2.379639e-05 |
| taxid_1759312_genomic.fna.gz | 0.000639 | 0.034805 | 0.260666 | 0.058385 | 0.005820 | 0.000915 | 2.118953e-04 | 1.701001e-04 |
| taxid_2026799_87_genomic.fna.gz | 0.000761 | 0.045469 | 0.262687 | 0.055671 | 0.004839 | 0.000117 | 9.684609e-06 | 5.321341e-05 |
| taxid_2041488_genomic.fna.gz | 0.000067 | 0.024380 | 0.216208 | 0.039611 | 0.003973 | 0.000272 | 7.260736e-05 | 1.380767e-04 |
| taxid_28901_877_genomic.fna.gz | 0.000608 | 0.029265 | 0.257055 | 0.151288 | 0.081336 | 0.052041 | 2.663244e-02 | 6.756324e-03 |
| taxid_554168_genomic.fna.gz | 0.001172 | 0.043717 | 0.304057 | 0.059005 | 0.005468 | 0.000548 | 1.954086e-05 | 8.476430e-07 |
| taxid_562_8705_genomic.fna.gz | 0.027607 | 0.039054 | 0.315867 | 0.110230 | 0.026908 | 0.012463 | 4.603500e-03 | 1.080697e-03 |
| taxid_573_36_genomic.fna.gz | 0.010152 | 0.038264 | 0.332896 | 0.118068 | 0.025898 | 0.009632 | 3.761221e-03 | 5.540108e-04 |

Now to test on a "real" metagenome...
And note, the problem appears to only be at k=12: with
testOrganism="../Organisms/taxid_1192839_4_genomic.fna.gz"
maxK=22
kSizes="14-${maxK}-1"
numHashes=10000
containmentThresh=0
locationOfThresh=-1
we get

Will create new issue for ground truth containment computation so it will be easier to track progress on this.