pyfaidx icon indicating copy to clipboard operation
pyfaidx copied to clipboard

Extend Faidx so it can be backed by memory mapped file

Open dhauge opened this issue 9 years ago • 1 comments

It would be nice if a Faidx instance could be constructed that accesses the fasta file via a memory mapped file rather than explicit file seek/read. This would have several benefits, including

  • Removing the need to protect reads against multiple threads
  • Avoiding the need to make system calls for each sequence read

I'd imagine adding a parameter 'file_mapped=True' to the 'Faidx' constructor, and leveraging the 'mutable' parameter to determine if the memory mapping is read-only. The 'Faidx.file' object should probably be wrapped in a small object that implements random access read/write of sequences, which could then work for both normal file access and memory mapped access.

If you would consider this an acceptable addition, and would accept a patch, I should be able to provide a potential implementation in the next few days.

Thanks, Doug

dhauge avatar Aug 04 '16 23:08 dhauge

Thanks for the suggestion! I actually implemented Faidx using a memory-mapped file in very early versions, but didn't see much performance benefit. I think that you only avoid a system call when the data resides in the OS buffer. Otherwise you generate a page fault which causes the OS to read the data from disk, which isn't really faster than reading from disk using a system call. Also I wanted to avoid any issues with mapping larger FASTA files under 32-bit OSes.

However, if you would like to submit a PR which adds the type of wrapper you describe then I'd be all for it, since there are definitely use cases where there could be a performance benefit. I'd prefer to leave the locks in place, unless there is a clear performance benefit to be had from removing them. Also, the locks around writing a new index file should stay in place no matter what.

Looking forward to seeing what you come up with!

mdshw5 avatar Aug 05 '16 15:08 mdshw5