MMseqs2 icon indicating copy to clipboard operation
MMseqs2 copied to clipboard

Error: Fasta entry is invalid in createdb

Open ncfrey opened this issue 1 year ago • 11 comments

Expected Behavior

I'm running mmseqs createdb on a large fasta file. I have checked that every entry is "valid" (only valid AA characters, can be read by biopython, there are no spaces between > and the accession ids). I expect successful db creation or graceful error handling.

Is there a way to ignore invalid fasta entries in db creation?

Current Behavior

Fails with error: "Fasta entry is invalid"

Steps to Reproduce (for bugs)

mmseqs createdb

MMseqs Output (for bugs)

Fasta entry <entry> is invalid

Your Environment

MMseqs2 Version: 13.45111 installed with conda

ncfrey avatar Oct 12 '23 14:10 ncfrey

Does this issue also happen in release 14?

milot-mirdita avatar Oct 12 '23 14:10 milot-mirdita

@milot-mirdita yes, i installed the latest version (from https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz ) and see the same error.

ncfrey avatar Oct 12 '23 14:10 ncfrey

I think this error can happen if there is a space (or possible other whitespace) after the > and before the accession.

milot-mirdita avatar Oct 12 '23 15:10 milot-mirdita

@milot-mirdita i've seen that error here: https://github.com/soedinglab/MMseqs2/issues/170 and have checked to make sure there are no spaces after >

ncfrey avatar Oct 12 '23 15:10 ncfrey

Then I am quite confused :D

Would it be possible to share the FASTA file that causes this error? If not, can you try to "bisect" the file and try to convert each half until you identify what entry might be broken.

milot-mirdita avatar Oct 12 '23 15:10 milot-mirdita

i'll try the bisecting method, thanks @milot-mirdita ! will report back if i figure it out.

ncfrey avatar Oct 12 '23 15:10 ncfrey

Any updates on what caused the issue? I'm running into the same error with protein fastas I downloaded from ncbi's ftp for bacterial proteins: https://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/*.faa.gz

mcn3159 avatar Apr 26 '24 18:04 mcn3159

Do you happen to know in which of these FASTA files the error occurs? That would help me reproduce the issue.

milot-mirdita avatar Apr 29 '24 09:04 milot-mirdita

I’m running createdb on 500+ fastas, so I’m not sure which one is causing the issue. However I do get an invalid entry number (similar to how it’s initially reported) is there a way to trace the entry number to the fasta causing the issue?

mcn3159 avatar Apr 29 '24 13:04 mcn3159

You could do a kind of binary search, just take half of the FASTA files each time. It should take at most 9 createdb calls to figure it out.

However, I suspect that its not actually a problem with individual files, but with the command line call becoming too long with the 500+ inputs.

You could do something like (however, that would lose the association with the source file, i.e. if you want to use the qset,tset format-output columns in convertalis).

find . -name "*.fasta" -exec cat {} \; | mmseqs createdb stdin out_db

milot-mirdita avatar May 01 '24 04:05 milot-mirdita

Thank you! I found the fasta causing the issue, and it turns out that when I downloaded it from NCBI, the download did not complete. When I redownloaded, and ran createdb, I did not get the error.

mcn3159 avatar May 03 '24 16:05 mcn3159