EToKi
EToKi copied to clipboard
Need documentation on cgMLST
I am taking some notes on how I ran cgMLST, and I hope you can add documentation for it.
Create database: this took a very long time
# Downloaded the cgMLST scheme from enterobase FTP into Salmonella.cgMLSTv2.enterobase (undocumented)
\ls -f1 Salmonella.cgMLSTv2.enterobase/*.fasta | \
grep -v cgMLST_v2_ref.fasta `# ignore already-established reference file` | \
xargs seqtk seq -l 0 `# cat out all the fasta contents and two-line fasta format` | \
perl -lane '
# get the id with '>' and the seq on the next line since it is in a two-line fasta format
$id=$F[0];
$seq=<>;
chomp($seq);
# I don't think this will matter but just avoid any infinite loops by quitting if we see the same sequence
my %seen;
if($seen{$id}++){print STDERR "Already seen $id. Done."; last;}
# Avoid deflines that might be problematic
if($id =~ /[^_>0-9a-zA-Z]/){
print STDERR "Skipping ".$id;
next;
}
print "$id\n$seq";
' > enterobase.filtered.fasta
I also need. I downloaded the cgMLST scheme for E.coli. When I tried to create the database for 4 days, the machine-time is only 1.2 hour. I found that the machine time nearly no longer increased when it was close to 1.2 hour. So I had to stop the command for creating a database.