sortmerna Error between min and max read

Hi

Thanks for developping this tools.

I just start using it, and I would like to share few suggestions :

in the log file, there is a sentence like

[store_to_db:292] Stored Reads statistics to DB:
     all_reads_count= 124368 all_reads_len= 18211881 min_read_len= 150 max_read_len= 25 total_aligned= 0 total_aligned_id= 0 total_aligned_cov= 0 total_aligned_id_cov= 0 total_denovo= 0 num_short= 0 reads_matched_per_db= TODO is_stats_calc= 0 is_total_reads_mapped_cov= 0

I guess that there is a minor code error where min_read_len and max_read_len are inverted. And reads_matched_per_db= TODO would be effectively a good idea even if now you recommand only one reference database. This brings me to the second point

Now that you share unique reference for all Kingdom and rRNA type (16S 18S 23S, 5S 5.8S), it is not possible to know from which Kingdom / rRNA type the rRNA reads came from. Reference taxonomies are kept in the description of reference fasta sequences so we can cross blast file and these taxonomies to see the Kingdom (Phylum ....) distribution but not the rRNA type. Do you think that it would be possible to keep rRNA type in the reference fasta description, and (with more programming effort), do you plan to provide a summary of Kingdom (Phylum ...) and rRNA type distribution automatically ? It would be very nice !!

Thanks again for the tool.

Maria

Oct 27 '21 09:10 mariabernard

Hi @mariabernard , I will let @biocodz or @ekopylova answer you about the future of SortMeRNA. I would just like to point out that, before, when we provided separate databases for each kingdom and rRNA type, it was really wrong to use the results as a classification. SortMeRNA is not designed to perform classification when provided with multiple databases. Instead, it would try very hard to match a read to the first database, then the second, etc... This is one of the reasons we decided to provide merged databases. However, now that the default usage should be to use a single merged database, you can approximate a classification (SMR is still not a classifier) by using the blast or SAM alignments to identify the best (or multiple best) match(es) for each read, which can give you an estimate classification. Also, you can already deduce the rRNA type in the provided reference fasta description. E.g: >SILVA_138_SSURef_NR99;LC109079.1.1742 Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Trebouxiophyceae;uncultured eukaryote;size=1 This is Silva, SSU, Eukaryota, thus 18S rRNA Another example: >RFAM_14.1_RF00001_5S_rRNA;M86547.1/142-262 is 5S rRNA

I hope this helps Pierre

Oct 27 '21 09:10 ppericard

Indeed, my purpose was not to classify reads, but to have an idea of efficiency of ribodepletion kit for a particular taxon and / or rRNA type.

Thanks to point me out the rRNA type based on db_name and Kingdom.

have a nice day

Maria

Oct 27 '21 09:10 mariabernard

Thank you for the input. We'll fix the min/max readlen error in the next release.

Oct 28 '21 21:10 biocodz

Renaming this issue as the classification part of your suggestions has been split to #378.

Jun 22 '23 07:06 cmith44

Error between min and max read_len in log file