cyvcf2 icon indicating copy to clipboard operation
cyvcf2 copied to clipboard

Streaming VCFs using smart_open?

Open dbrami opened this issue 2 years ago • 3 comments

Hi, I've seen open issue #174 Can't use cyvcf2 against AWS S3, and I'm assuming the intent is to download VCFs locally to be open by cyvcf2. My question is how easy/hard is it to use smart_open in conjunction with cyvcf2 to stream needed regions of vcf from AWS S3 as needed instead of downloading all VCF first?

dbrami avatar Jan 06 '23 20:01 dbrami

Hi, You should be able to use cyvcf2 directly. But, if the handle you pass to VCF has a fileno method or is an integer, the it will be treated as a file descriptor. Note that htslib will handle AWS authentication for you if you use, for example the environment variables:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_SESSION_TOKEN
AWS_DEFAULT_REGION
AWS_DEFAULT_PROFILE
AWS_PROFILE

brentp avatar Jan 07 '23 16:01 brentp

Could you elaborate please? I would just:

  • set all the needed AWS OS environment variables you listed
  • use cyvcf2 as so:
from cyvcf2 import VCF

for variant in VCF('s3://abc-def-results/P-20230106-0003/folder1/sampe1.vcf.gz'): 

dbrami avatar Jan 08 '23 18:01 dbrami

Hi, The proposed solution does not seem to work. Did i miss something?

(base) ➜  AncestryML git:(main) ✗ export AWS_ACCESS_KEY_ID=ABCDE...
(base) ➜  AncestryML git:(main) ✗ export AWS_SECRET_ACCESS_KEY=DEFGH...

(ML) ➜  AncestryML git:(main) ✗ python VCF_to_hash.py -p P-20230109-1234 -v s3://1000genomes-dragen-3.7.6/data/individuals/hg38-graph-based/NA20787/NA20787.hard-filtered.vcf.gz
2023-01-18 16:57:55,894 - root - INFO - Logger initialized
2023-01-18 16:57:55,895 - root - INFO - Parsing command-line parameters
2023-01-18 16:57:55,898 - root - INFO - Parsing general config file cfg/GenomeHashConfig.yaml
2023-01-18 16:57:55,911 - root - INFO - Processing VCF:	/Users/bramid/PycharmProjects/AncestryML/s3:/1000genomes-dragen-3.7.6/data/individuals/hg38-graph-based/NA20787/NA20787.hard-filtered.vcf.gz
[E::hts_open_format] Failed to open file "/Users/bramid/PycharmProjects/AncestryML/s3:/1000genomes-dragen-3.7.6/data/individuals/hg38-graph-based/NA20787/NA20787.hard-filtered.vcf.gz" : No such file or directory
Traceback (most recent call last):
  File "/Users/bramid/PycharmProjects/AncestryML/VCF_to_hash.py", line 352, in <module>
    main()
  File "/Users/bramid/PycharmProjects/AncestryML/VCF_to_hash.py", line 315, in main
    sample_variants_dict, samples_list, sample_project_dict = parse_vcf(vcf_list, df_var_signature,
  File "/Users/bramid/PycharmProjects/AncestryML/VCF_to_hash.py", line 218, in parse_vcf
    vcf = VCF(input_vcf)
  File "cyvcf2/cyvcf2.pyx", line 258, in cyvcf2.cyvcf2.VCF.__init__
  File "cyvcf2/cyvcf2.pyx", line 190, in cyvcf2.cyvcf2.HTSFile._open_htsfile
OSError: Error opening /Users/bramid/PycharmProjects/AncestryML/s3:/1000genomes-dragen-3.7.6/data/individuals/hg38-graph-based/NA20787/NA20787.hard-filtered.vcf.gz

dbrami avatar Jan 19 '23 01:01 dbrami