CMash Testing environment

Current tests are end-to-end integration tests that makes sure scripts execute successfully. There is much more testing that could be done including:

[x] adding unit tests to the tests folder (lots can be copied from CMash/MinHash.py)
[x] add additional unit tests as code coverage is rather low at this point
[ ] test validity of Query module results
[ ] automate script tests so manual checking is no longer required.
[x] brute force calculate containment indicies and check that results of all scripts are within acceptable error ranges.
[ ] add tests for the multiple k-mer size features in test_scripts
[ ] Set up Travis CI

Feb 06 '20 20:02 dkoslicki

Some unit tests are in. MinHash module tests check validity of results. Query module is only really checking for code-breaking errors at this point, as there are a lot of FIXME's and TODO's.

Will need to:

[ ] test validity of Query module results
[ ] automate script tests so manual checking is no longer required.
[ ] brute force calculate containment indicies and check that results of all scripts are within acceptable error ranges.

Will be tagging this as help wanted and assigning everyone, since all are welcome to contribute.

SOP: create new branch:

git checkout master
git pull origin master  # make sure code is up to date
git checkout -b <some_feature_branch_name>  # create a new branch implementing a new testing feature
# add your new feature
git commit -a  # commit your contributions
git push origin <some_feature_branch_name>  # push your changes to your feature branch
# then request a code review before merging to master

Mar 23 '20 22:03 dkoslicki

Note: while I assigned all, this is mainly a QOL (quality of life) issue: things that will make our future contributions easier in the future, but should not distract from main projects. i.e. as time permits.

Mar 23 '20 22:03 dkoslicki

@dkoslicki Make sure GroundTruth.py is identifying kmers and rc-kmers, not counting them as distinct.

Mar 27 '20 05:03 dkoslicki

./run_small_tests.sh ,k=10,k=12,k=14,k=16,k=18,k=20 taxid_1192839_4_genomic.fna.gz,1.0,1.0,1.0,1.0,1.0,1.0 taxid_28901_877_genomic.fna.gz,1.0,0.786,0.416,0.332,0.294,0.274

Ground truth on server since takes a fair bit of memory import CMash.GroundTruth as G query_file="/data/dmk333/repos/CMash/tests/Organisms/taxid_1192839_4_genomic.fna.gz" training_file="/data/dmk333/repos/CMash/tests/script_tests/TrainingDatabase.h5" g = G.TrueContainment(training_file, "10-21-2") df = g.return_containment_data_frame(query_file, -1, .1) print(df) k=10 k=12 k=14 k=16 k=18 k=20 taxid_1192839_4_genomic.fna.gz 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 taxid_28901_877_genomic.fna.gz 0.970794 0.648166 0.404911 0.336364 0.303958 0.279067

Well that looks pretty nice to me!

Mar 27 '20 05:03 dkoslicki

Switched to canonical k-mers to sanity check things, results basically unchanged: Ground truth on server since takes a fair bit of memory k=10 k=12 k=14 k=16 k=18 k=20 taxid_1192839_4_genomic.fna.gz 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000 taxid_28901_877_genomic.fna.gz 0.970735 0.64816 0.404912 0.336364 0.303959 0.279068

So we'll be sticking with canonical k-mers for the ground truth as it's much more straightforward to understand.

Mar 27 '20 15:03 dkoslicki

Note to self @dkoslicki: something odd is happening at small k-mer sizes: using run_comparison_to_ground_truth.sh via GroundTruth.py, in __return_containment_index:

return len(set1.intersection(set2)) / float(len(set1))

seems correct, but

return len(set1.intersection(set2)) / float(len(set2))

returns accurate small k-mer size results... eg.

import CMash.GroundTruth as G
training_database_file = "/home/dkoslicki/Desktop/CMash/tests/script_tests/TrainingDatabase.h5"
query_file1 = "/home/dkoslicki/Desktop/CMash/tests/Organisms/taxid_1192839_4_genomic.fna.gz"
query_file2 = "/home/dkoslicki/Desktop/CMash/tests/Organisms/taxid_562_8705_genomic.fna.gz"
g = G.TrueContainment(training_database_file, "4-6-1")
len(g.training_file_to_ksize_to_kmers[query_file1][4].intersection(g.training_file_to_ksize_to_kmers[query_file2][4]))/float(len(g.training_file_to_ksize_to_kmers[query_file1][4]))
1.0
len(g.training_file_to_ksize_to_kmers[query_file1][4].intersection(g.training_file_to_ksize_to_kmers[query_file2][4]))/float(len(g.training_file_to_ksize_to_kmers[query_file2][4]))
0.3056179775280899

And the StreamingQueryDNADatabase.py is returning a 1 (not the 0.3056). Clearly, query_file2 is basically three copies of query_file1 at k=4, but why ok results at higher k-mer sizes?

Oh yeah, and StreamingQueryDNADatabase.py uses a heck of a lot of memory for small k-mer sizes. Probably khmer or screed's fault, but that's TBD.

Mar 27 '20 20:03 dkoslicki

Regarding direction of containment, I think the committed way is best: set1 as denom

Total error per k-mer size:
k=8     0.043016
k=10    0.354925
k=12    2.485572
k=14    0.690597
k=16    0.161794
k=18    0.076439
k=20    0.035385
k=22    0.008816
dtype: float64

set2 as denom:

Total error per k-mer size:
k=8     0.168598
k=10    2.173376
k=12    3.924027
k=14    0.832073
k=16    0.140583
k=18    0.047191
k=20    0.009990
k=22    0.018207
dtype: float64

But clearly something is up with k=12. Odd... This is using run_comparison_to_ground_truth.sh with:

testOrganism="../Organisms/taxid_1192839_4_genomic.fna.gz"
maxK=22
kSizes="8-${maxK}-2"
numHashes=10000
containmentThresh=0
locationOfThresh=-1

But clearly something is up with k=12. Odd... |true-CMash|:

genome	k=8	k=10	k=12	k=14	k=16	k=18	k=20	k=22
taxid_1192839_4_genomic.fna.gz	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000e+00	0.000000e+00
taxid_1307_414_genomic.fna.gz	0.000761	0.049504	0.257595	0.048064	0.003751	0.000108	2.923078e-07	3.920249e-05
taxid_1311_236_genomic.fna.gz	0.001250	0.050466	0.278542	0.050275	0.003800	0.000344	7.373714e-05	2.379639e-05
taxid_1759312_genomic.fna.gz	0.000639	0.034805	0.260666	0.058385	0.005820	0.000915	2.118953e-04	1.701001e-04
taxid_2026799_87_genomic.fna.gz	0.000761	0.045469	0.262687	0.055671	0.004839	0.000117	9.684609e-06	5.321341e-05
taxid_2041488_genomic.fna.gz	0.000067	0.024380	0.216208	0.039611	0.003973	0.000272	7.260736e-05	1.380767e-04
taxid_28901_877_genomic.fna.gz	0.000608	0.029265	0.257055	0.151288	0.081336	0.052041	2.663244e-02	6.756324e-03
taxid_554168_genomic.fna.gz	0.001172	0.043717	0.304057	0.059005	0.005468	0.000548	1.954086e-05	8.476430e-07
taxid_562_8705_genomic.fna.gz	0.027607	0.039054	0.315867	0.110230	0.026908	0.012463	4.603500e-03	1.080697e-03
taxid_573_36_genomic.fna.gz	0.010152	0.038264	0.332896	0.118068	0.025898	0.009632	3.761221e-03	5.540108e-04

Screenshot 2020-03-27 18 06 28

Now to test on a "real" metagenome...

Mar 27 '20 21:03 dkoslicki

And note, the problem appears to only be at k=12: with

testOrganism="../Organisms/taxid_1192839_4_genomic.fna.gz"
maxK=22
kSizes="14-${maxK}-1"
numHashes=10000
containmentThresh=0
locationOfThresh=-1

we get Screenshot 2020-03-27 18 10 24

Mar 27 '20 22:03 dkoslicki

Will create new issue for ground truth containment computation so it will be easier to track progress on this.

Apr 06 '20 19:04 dkoslicki