metaGEM
metaGEM copied to clipboard
improvement: use `wc` to count bp instead of summing contig header info for abundance
For abundance calculation we need to get bp of genome, currently we do this based on the contig headers info, although this can become problematic for genomes assembled by different tools (e.g. shovill renames them iirc). Replace this with a simple wc
command to count bases in fasta
https://github.com/franciscozorrilla/metaGEM/blob/a4daea0be08743e277a3474f2b7967892ce6f1ec/workflow/Snakefile#L1118-L1124
Temporary workaround to deal with megahit, shovil, and metaGEM outputs
# Check if bins are original (megahit-assembled) or strict/permissive (metaspades-assembled)
if [[ $bin == *.strict.fa ]] || [[ $bin == *.permissive.fa ]] || [[ $bin == *.s.fa ]] || [[ $bin == *.p.fa ]];then
less $bin |grep ">"|cut -d '_' -f4|awk '{{sum+=$1}}END{{print sum}}' >> $(echo "$bin"|sed "s/.fa/.map/")
elif [[ $bin == *.shovill.fa ]];then
less $bin |grep ">"|cut -d ' ' -f2|sed 's/len=//g'|awk '{{sum+=$1}}END{{print sum}}' >> $(echo "$bin"|sed "s/.fa/.map/")
elif [[ $bin == *megahit.fa ]];then
less $bin |grep ">"|cut -d '-' -f4|sed 's/len=//g'|awk '{{sum+=$1}}END{{print sum}}' >> $(echo "$bin"|sed "s/.fa/.map/")
else
less $bin |grep ">"|cut -d '-' -f4|sed 's/len_//g'|awk '{{sum+=$1}}END{{print sum}}' >> $(echo "$bin"|sed "s/.fa/.map/")
fi
Rather than keep extending this for each possible case, just use wc
to count bp after removing newlines
genome.fa|grep -v '>'|tr -d '\n'|wc -m