metaGEM icon indicating copy to clipboard operation
metaGEM copied to clipboard

improvement: use `wc` to count bp instead of summing contig header info for abundance

Open franciscozorrilla opened this issue 1 year ago • 1 comments

For abundance calculation we need to get bp of genome, currently we do this based on the contig headers info, although this can become problematic for genomes assembled by different tools (e.g. shovill renames them iirc). Replace this with a simple wc command to count bases in fasta

https://github.com/franciscozorrilla/metaGEM/blob/a4daea0be08743e277a3474f2b7967892ce6f1ec/workflow/Snakefile#L1118-L1124

franciscozorrilla avatar Apr 12 '23 10:04 franciscozorrilla

Temporary workaround to deal with megahit, shovil, and metaGEM outputs

# Check if bins are original (megahit-assembled) or strict/permissive (metaspades-assembled)
            if [[ $bin == *.strict.fa ]] || [[ $bin == *.permissive.fa ]] || [[ $bin == *.s.fa ]] || [[ $bin == *.p.fa ]];then
                less $bin |grep ">"|cut -d '_' -f4|awk '{{sum+=$1}}END{{print sum}}' >> $(echo "$bin"|sed "s/.fa/.map/")
            elif [[ $bin == *.shovill.fa ]];then
                less $bin |grep ">"|cut -d ' ' -f2|sed 's/len=//g'|awk '{{sum+=$1}}END{{print sum}}' >> $(echo "$bin"|sed "s/.fa/.map/")
            elif [[ $bin == *megahit.fa ]];then
                less $bin |grep ">"|cut -d '-' -f4|sed 's/len=//g'|awk '{{sum+=$1}}END{{print sum}}' >> $(echo "$bin"|sed "s/.fa/.map/")
            else
                less $bin |grep ">"|cut -d '-' -f4|sed 's/len_//g'|awk '{{sum+=$1}}END{{print sum}}' >> $(echo "$bin"|sed "s/.fa/.map/")
            fi

Rather than keep extending this for each possible case, just use wc to count bp after removing newlines

genome.fa|grep -v '>'|tr -d '\n'|wc -m

franciscozorrilla avatar Apr 12 '23 12:04 franciscozorrilla