MeShClust icon indicating copy to clipboard operation
MeShClust copied to clipboard

Too many sequences below identity

Open ksahlin opened this issue 7 years ago • 5 comments

Hi again,

I tried running MeShClust on 500 sequences that I simulated, all of length ~900nucleotides with most of the sequences highly similar (edit distances 1-20bp). A small portion of these sequences might have a high error rate, roughly Pacbios error rate of 10-15%. This is suppose to mimic PacBio Iso-Seq data. Any idea on how I should run MeShClust on such a dataset? Is it suitable for such sequences?

Thanks for your help!

[ksahlin@desmond bin]$ ./meshclust /nfs/brubeck.bx.psu.edu/scratch6/ksahlin/IsoCon_paper_n_10000/pacbio_reads/MEMBER_EXPERIMENT/TSPY13P_8_exponential_0.0001_500_1.fa --output ~/tmp/MESHCLUST/TSPY.clstr
avg length: 915
Recommended K: 4
Reading in sequences [=================================================] 100 %
Using 8 bit histograms
Counting 4-mers [======================================================] 100 %
Splitting data
Point pairs: 38
Sorting data [=========================================================] 100 %
Warning: Alignment may be too large for sampling
Before Pair: >TSPY13P;member:10;exons:1,2,3,4,5,6:copy10_read_36_error_rate_0.0_total_errors_0, >TSPY13P;member:10;exons:1,2,3,4,5,6:copy14_read_170_error_rate_0.010857763300760043_total_errors_10
Before Pair: >TSPY13P;member:10;exons:1,2,3,4,5,6:copy10_read_36_error_rate_0.0_total_errors_0, >TSPY13P;member:10;exons:1,2,3,4,5,6:copy5_read_242_error_rate_0.001092896174863388_total_errors_1
Before Pair: >TSPY13P;member:10;exons:1,2,3,4,5,6:copy10_read_36_error_rate_0.0_total_errors_0, >TSPY13P;member:8;exons:1,2,3,4,5,6:copy34_read_418_error_rate_0.003278688524590164_total_errors_3
Before Pair: >TSPY13P;member:10;exons:1,2,3,4,5,6:copy10_read_36_error_rate_0.0_total_errors_0, >TSPY13P;member:8;exons:1,2,3,4,5,6:copy65_read_267_error_rate_0.002185792349726776_total_errors_2
Alignment [============================================================] 100 %
positive=0 negative=986
Identity value does not match sampled data: Too many sequences below identity

ksahlin avatar Mar 23 '18 00:03 ksahlin

Since the data set size is very low (500 sequences), an easy workaround for now may be to provide alignment scores via --align instead of using the classification which doesn't seem to work for your case.

It seems highly peculiar that all the sequences with very high similarity are showing up as negatives. If you don't mind, can I have a sample of the data?

benjamin-james avatar Mar 23 '18 18:03 benjamin-james

Ok, thanks! Attached is the full 500 simulated dataset. I'll try on a larger simulation and let you know. TSPY_simulated_500.txt

ksahlin avatar Mar 23 '18 20:03 ksahlin

I get the same error message for datasets with the same simulation parameters but with 2500 and 12500 sequences as well.

When I try with the parameter --align I get the following message:

avg length: 915
Recommended K: 4
Reading in sequences [=================================================] 100 %
Using 8 bit histograms
Counting 4-mers [======================================================] 100 %
Adding combo 1
new single feature 1
error: list not sorted                                                 ] 0 %
terminate called after throwing an instance of 'int'
Aborted

ksahlin avatar Mar 23 '18 21:03 ksahlin

Ok I had success, I manually specified the identity parameter to be anything above 0.9 and it worked. The alignment error message is a known bug, and I have been working on it.

benjamin-james avatar Mar 23 '18 21:03 benjamin-james

Great that works, thanks!

ksahlin avatar Mar 23 '18 21:03 ksahlin