Error between min and max read_len in log file
Hi
Thanks for developping this tools.
I just start using it, and I would like to share few suggestions :
- in the log file, there is a sentence like
[store_to_db:292] Stored Reads statistics to DB:
all_reads_count= 124368 all_reads_len= 18211881 min_read_len= 150 max_read_len= 25 total_aligned= 0 total_aligned_id= 0 total_aligned_cov= 0 total_aligned_id_cov= 0 total_denovo= 0 num_short= 0 reads_matched_per_db= TODO is_stats_calc= 0 is_total_reads_mapped_cov= 0
I guess that there is a minor code error where min_read_len and max_read_len are inverted.
And reads_matched_per_db= TODO would be effectively a good idea even if now you recommand only one reference database. This brings me to the second point
- Now that you share unique reference for all Kingdom and rRNA type (16S 18S 23S, 5S 5.8S), it is not possible to know from which Kingdom / rRNA type the rRNA reads came from. Reference taxonomies are kept in the description of reference fasta sequences so we can cross blast file and these taxonomies to see the Kingdom (Phylum ....) distribution but not the rRNA type. Do you think that it would be possible to keep rRNA type in the reference fasta description, and (with more programming effort), do you plan to provide a summary of Kingdom (Phylum ...) and rRNA type distribution automatically ? It would be very nice !!
Thanks again for the tool.
Maria
Hi @mariabernard , I will let @biocodz or @ekopylova answer you about the future of SortMeRNA.
I would just like to point out that, before, when we provided separate databases for each kingdom and rRNA type, it was really wrong to use the results as a classification. SortMeRNA is not designed to perform classification when provided with multiple databases. Instead, it would try very hard to match a read to the first database, then the second, etc...
This is one of the reasons we decided to provide merged databases.
However, now that the default usage should be to use a single merged database, you can approximate a classification (SMR is still not a classifier) by using the blast or SAM alignments to identify the best (or multiple best) match(es) for each read, which can give you an estimate classification.
Also, you can already deduce the rRNA type in the provided reference fasta description.
E.g: >SILVA_138_SSURef_NR99;LC109079.1.1742 Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Trebouxiophyceae;uncultured eukaryote;size=1
This is Silva, SSU, Eukaryota, thus 18S rRNA
Another example: >RFAM_14.1_RF00001_5S_rRNA;M86547.1/142-262 is 5S rRNA
I hope this helps Pierre
Indeed, my purpose was not to classify reads, but to have an idea of efficiency of ribodepletion kit for a particular taxon and / or rRNA type.
Thanks to point me out the rRNA type based on db_name and Kingdom.
have a nice day
Maria
Thank you for the input. We'll fix the min/max readlen error in the next release.
Renaming this issue as the classification part of your suggestions has been split to #378.