Phylign icon indicating copy to clipboard operation
Phylign copied to clipboard

new command 'make label' assign a label to query draft assembly based on the best hits from COBS

Open jorgeavilacartes opened this issue 4 months ago • 0 comments

Hello,

In this pull request, I included:

  1. a modification of the Snakefile to support a larger number of input files. Why? when using ~400 files, the code crashed because the concatenation of their names was too long. So I simply modified the get_filename_for_all_queries() function to return a fixed string. See here
  2. "fna" was included in the list of accepted extensions, since this is the default format of assemblies downloaded from NCBI (with ncbi-datasets).
  3. scripts and files to assign a label to a query draft assembly at the species level,

How are labels assigned to a query draft assembly? Since each contig in a draft assembly is considered as a query, I parsed the output file from intermediate/04_filter to collect all hits of each assembly (i.e. the collection of hits of its contigs).
Each hit (represented by the sampleID of an assembly) is mapped to its label, and the label assigned to the query assembly corresponds to the most common label of its hits.

The labels correspond to the second column of the Kraken Braken (most abundant species) file that was used to create the clusters. The file data/labels_krakenbracken_by_sampleid.txt was included in the repository.

NOTE: these modifications do not interfere with the main pipeline, since it can be run after make match. See updated README

jorgeavilacartes avatar Feb 28 '24 16:02 jorgeavilacartes