pyfastx icon indicating copy to clipboard operation
pyfastx copied to clipboard

some sequences are missing in pyfastx.Fasta object

Open dawnmy opened this issue 2 years ago • 4 comments

I loaded a fasta file containing 4542 sequences with average length of 2.5kb, however only 4539 sequences were in the pyfastx.Fasta object.

fa = pyfastx.Fasta('assembly.fasta')
fa['contig_4540'] # keyError

Besides, I could access a sequence e.g. fa['contig_999'] for the first time. But when I try to access it again I got keyError.

The version of pyfastx I used is 0.8.4, Python version 3.7

dawnmy avatar Mar 12 '22 22:03 dawnmy

Thank you for reporting this issue. I will check that. A new version will be released soon.

lmdu avatar Mar 15 '22 13:03 lmdu

Any updates on this? I'm getting the same error: I'm loading a large fasta file (~59M entries), and for some of the indices (when accessing by string key and by integer index), I'm getting a key does not exist error. Reloading the file solves the problem for given keys, but shifts it to others. I'm using pyfastx 1.1.0

floccinauc avatar Aug 31 '23 09:08 floccinauc

Thanks. Could you provide me your code and data https links.

lmdu avatar Aug 31 '23 09:08 lmdu

I'm using the unzipped version of this file https://stringdb-downloads.org/download/protein.sequences.v12.0.fa.gz. As for my code, the simple snippet below does not seem to reproduce this error:

import pyfastx from tqdm import tqdm FILEPATH="/dccstor/bmfmbio/datasets/STRING/all/protein.sequences.v12.0.fa" loaded_fasta = pyfastx.Fasta(FILEPATH) for idx in tqdm(range(int(5e7))): a = loaded_fasta[idx]

Maybe it has to do with multiple workers accessing the same fasta file? I'm afraid I cannot post the actual code I'm using at this point.

floccinauc avatar Aug 31 '23 11:08 floccinauc