MMseqs2
MMseqs2 copied to clipboard
Error: Fasta entry is invalid in createdb
Expected Behavior
I'm running mmseqs createdb
on a large fasta file. I have checked that every entry is "valid" (only valid AA characters, can be read by biopython, there are no spaces between >
and the accession ids). I expect successful db creation or graceful error handling.
Is there a way to ignore invalid fasta entries in db creation?
Current Behavior
Fails with error: "Fasta entry
Steps to Reproduce (for bugs)
mmseqs createdb
MMseqs Output (for bugs)
Fasta entry <entry> is invalid
Your Environment
MMseqs2 Version: 13.45111 installed with conda
Does this issue also happen in release 14?
@milot-mirdita yes, i installed the latest version (from https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz ) and see the same error.
I think this error can happen if there is a space (or possible other whitespace) after the >
and before the accession.
@milot-mirdita i've seen that error here: https://github.com/soedinglab/MMseqs2/issues/170
and have checked to make sure there are no spaces after >
Then I am quite confused :D
Would it be possible to share the FASTA file that causes this error? If not, can you try to "bisect" the file and try to convert each half until you identify what entry might be broken.
i'll try the bisecting method, thanks @milot-mirdita ! will report back if i figure it out.
Any updates on what caused the issue? I'm running into the same error with protein fastas I downloaded from ncbi's ftp for bacterial proteins: https://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/*.faa.gz
Do you happen to know in which of these FASTA files the error occurs? That would help me reproduce the issue.
I’m running createdb on 500+ fastas, so I’m not sure which one is causing the issue. However I do get an invalid entry number (similar to how it’s initially reported) is there a way to trace the entry number to the fasta causing the issue?
You could do a kind of binary search, just take half of the FASTA files each time. It should take at most 9 createdb
calls to figure it out.
However, I suspect that its not actually a problem with individual files, but with the command line call becoming too long with the 500+ inputs.
You could do something like (however, that would lose the association with the source file, i.e. if you want to use the qset,tset format-output columns in convertalis).
find . -name "*.fasta" -exec cat {} \; | mmseqs createdb stdin out_db
Thank you! I found the fasta causing the issue, and it turns out that when I downloaded it from NCBI, the download did not complete. When I redownloaded, and ran createdb, I did not get the error.