pyfaidx icon indicating copy to clipboard operation
pyfaidx copied to clipboard

Enable pyfaidx to accept gcp paths for compressed fasta files

Open archanaraja opened this issue 5 years ago • 2 comments

Hi, Currently Im getting an error , loading compressed fasta files from gcp paths, it would be great if this feature in enabled like Pandas. Are there any plans to have it in the near future thanks. Archana

archanaraja avatar May 21 '20 21:05 archanaraja

Hey @archanaraja thanks for raising this issue and apologies for the late response. I'm not familiar with the Google Cloud Storage apis and was not planning to implement this. If you can describe your use case in a bit more detail I may be able to help. It looks like google's python package implements byte ranges, so assuming the FASTA index file is present I can imagine pyfaidx reading the FAI into memory and then making calls to GCP for the specific sequences we need. If the FASTA is not indexed then pyfaidx would need to stream the entire FASTA and produce an index. That's not too efficient and also brings in the issue of what to do with the newly created FASTA index (do we store it for re-use somewhere or do we rebuild the index from scratch the next time we initialize?).

mdshw5 avatar May 30 '20 00:05 mdshw5

Thank you for the prompt response, I was trying to use the kit to compute length of fastqs in a file as a qc , since it requires indexing as they are not present it looks like there is not an easy fix. All my data is on gcp , pandas package could directly read files from GCP looks like I should use byte ranges like you suggested to make faidx work.

Thanks for the detailed explanation.

Archana

From: Matt Shirley [email protected] Reply-To: mdshw5/pyfaidx [email protected] Date: Friday, May 29, 2020 at 5:13 PM To: mdshw5/pyfaidx [email protected] Cc: Archana Natarajan Raja [email protected], Mention [email protected] Subject: Re: [mdshw5/pyfaidx] Enable pyfaidx to accept gcp paths for compressed fasta files (#161)

Hey @archanarajahttps://github.com/archanaraja thanks for raising this issue and apologies for the late response. I'm not familiar with the Google Cloud Storage apis and was not planning to implement this. If you can describe your use case in a bit more detail I may be able to help. It looks like google's python package implements byte rangeshttps://googleapis.dev/python/storage/latest/client.html#google.cloud.storage.client.Client.download_blob_to_file, so assuming the FASTA index file is present I can imagine pyfaidx reading the FAI into memory and then making calls to GCP for the specific sequences we need. If the FASTA is not indexed then pyfaidx would need to stream the entire FASTA and produce an index. That's not too efficient and also brings in the issue of what to do with the newly created FASTA index (do we store it for re-use somewhere or do we rebuild the index from scratch the next time we initialize?).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/mdshw5/pyfaidx/issues/161#issuecomment-636243762, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABSZAPZLI5S6TTHLBNLG6BTRUBFRPANCNFSM4NHF6YJA.

archanaraja avatar May 30 '20 03:05 archanaraja