pufferfish icon indicating copy to clipboard operation
pufferfish copied to clipboard

Parallelize fixFasta

Open hermidalc opened this issue 1 year ago • 0 comments

The initial fixFasta step of Pufferfish indexing is single-threaded, and when there are a lot of sequences in the reference it takes a lot of time. From the outside it seems like this step could be parallelized, with the input reference FASTA split into parts, e.g. using the fast SeqKit toolkit and split2 command, which can output gzipped or regular split FASTA files from a gzipped or regular input reference FASTA (to save disk space for example), and then processing each split using fixFasta and concatenating the fixed splits into one.

hermidalc avatar Sep 11 '22 15:09 hermidalc