genomelake icon indicating copy to clipboard operation
genomelake copied to clipboard

Migrate from pysam to pyfaidx

Open Avsecz opened this issue 6 years ago • 2 comments

since the pysam dependency cause trouble (when the conda channels as misconfigured) and since chromosome name parsing is very fragile in pysam (https://github.com/kipoi/kipoi/issues/330#issuecomment-420054392), I propose to migrate from pysam.FastaFile to pyfaidx.Fasta.

Avsecz avatar Oct 01 '18 22:10 Avsecz

We're experiencing similar issues with pysam, and as the author of pyfaidx please let me know if there's anything I could do to help. It's worth noting that pyfaidx.FastaRecord implements a numpy array interface which could help with the efficiency (or at least clarity) of your one hot encoding, and as ML methods are something I'd like to support more directly maybe it makes sense to provide this capability in a simpler and more efficient way.

mdshw5 avatar Feb 06 '20 17:02 mdshw5

Hi! Totally agree with It. Thanks for developing and supporting pyfaidx! I made the PR more than a year ago to solve this issue: https://github.com/kundajelab/genomelake/pull/19. @jisraeli @chrisprobert any plans to merge it?

Regarding the one-hot-encoding, I think the main overhead is to iterate through the python string and populate the numpy array. Genomelake uses cython to do this for efficiently. I think the best solution would be to write a separate library just doing one-hot-encoding and then use it together with pyfaidx.

Avsecz avatar Feb 07 '20 00:02 Avsecz