EToKi icon indicating copy to clipboard operation
EToKi copied to clipboard

Need documentation on cgMLST

Open lskatz opened this issue 2 years ago • 1 comments

I am taking some notes on how I ran cgMLST, and I hope you can add documentation for it.

Create database: this took a very long time

# Downloaded the cgMLST scheme from enterobase FTP into Salmonella.cgMLSTv2.enterobase (undocumented)
\ls -f1 Salmonella.cgMLSTv2.enterobase/*.fasta | \
  grep -v cgMLST_v2_ref.fasta `# ignore already-established reference file` | \
  xargs seqtk seq -l 0 `# cat out all the fasta contents and two-line fasta format` | \
  perl -lane '
    # get the id with '>' and the seq on the next line since it is in a two-line fasta format
    $id=$F[0]; 
    $seq=<>; 
    chomp($seq); 
    # I don't think this will matter but just avoid any infinite loops by quitting if we see the same sequence
    my %seen; 
    if($seen{$id}++){print STDERR "Already seen $id. Done."; last;} 

    # Avoid deflines that might be problematic
    if($id =~ /[^_>0-9a-zA-Z]/){
      print STDERR "Skipping ".$id; 
      next;
    } 
    print "$id\n$seq";
  ' > enterobase.filtered.fasta

lskatz avatar May 05 '22 14:05 lskatz

I also need. I downloaded the cgMLST scheme for E.coli. When I tried to create the database for 4 days, the machine-time is only 1.2 hour. I found that the machine time nearly no longer increased when it was close to 1.2 hour. So I had to stop the command for creating a database.

verylili avatar Jul 20 '23 07:07 verylili