Phylign icon indicating copy to clipboard operation
Phylign copied to clipboard

batch_align.py loads up the whole query fasta into RAM

Open leoisl opened this issue 1 year ago • 2 comments

See https://github.com/karel-brinda/mof-search/blob/e8e681b67538c3eadff2e577581a36183cd27303/scripts/batch_align.py#L150-L154

This clearly does not scale well when the query fasta is massive (e.g. read sets). One easy and quick way to save a bit more RAM is to just load queries that map to the given batch. Should I implement this @karel-brinda , as it is pretty quick to do? Of course if the whole or most of the read set still maps to the batch, we will still load lots of things. Only way through this I think is to create a fasta index on the query fasta and load only the fasta IDs, with the sequences being loaded from the disk by demand...

This does not matter much if mof-search use case does not concern read sets mapping, which is what I thought from the beginning, but I know you've been mapping ONT datasets with it...

leoisl avatar Apr 03 '23 14:04 leoisl