cyvcf2
cyvcf2 copied to clipboard
Streaming VCFs using smart_open?
Hi, I've seen open issue #174 Can't use cyvcf2 against AWS S3, and I'm assuming the intent is to download VCFs locally to be open by cyvcf2. My question is how easy/hard is it to use smart_open in conjunction with cyvcf2 to stream needed regions of vcf from AWS S3 as needed instead of downloading all VCF first?
Hi,
You should be able to use cyvcf2 directly. But, if the handle you pass to VCF has a fileno method or is an integer, the it will be treated as a file descriptor.
Note that htslib will handle AWS authentication for you if you use, for example the environment variables:
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_SESSION_TOKEN
AWS_DEFAULT_REGION
AWS_DEFAULT_PROFILE
AWS_PROFILE
Could you elaborate please? I would just:
- set all the needed AWS OS environment variables you listed
- use cyvcf2 as so:
from cyvcf2 import VCF
for variant in VCF('s3://abc-def-results/P-20230106-0003/folder1/sampe1.vcf.gz'):
Hi, The proposed solution does not seem to work. Did i miss something?
(base) ➜ AncestryML git:(main) ✗ export AWS_ACCESS_KEY_ID=ABCDE...
(base) ➜ AncestryML git:(main) ✗ export AWS_SECRET_ACCESS_KEY=DEFGH...
(ML) ➜ AncestryML git:(main) ✗ python VCF_to_hash.py -p P-20230109-1234 -v s3://1000genomes-dragen-3.7.6/data/individuals/hg38-graph-based/NA20787/NA20787.hard-filtered.vcf.gz
2023-01-18 16:57:55,894 - root - INFO - Logger initialized
2023-01-18 16:57:55,895 - root - INFO - Parsing command-line parameters
2023-01-18 16:57:55,898 - root - INFO - Parsing general config file cfg/GenomeHashConfig.yaml
2023-01-18 16:57:55,911 - root - INFO - Processing VCF: /Users/bramid/PycharmProjects/AncestryML/s3:/1000genomes-dragen-3.7.6/data/individuals/hg38-graph-based/NA20787/NA20787.hard-filtered.vcf.gz
[E::hts_open_format] Failed to open file "/Users/bramid/PycharmProjects/AncestryML/s3:/1000genomes-dragen-3.7.6/data/individuals/hg38-graph-based/NA20787/NA20787.hard-filtered.vcf.gz" : No such file or directory
Traceback (most recent call last):
File "/Users/bramid/PycharmProjects/AncestryML/VCF_to_hash.py", line 352, in <module>
main()
File "/Users/bramid/PycharmProjects/AncestryML/VCF_to_hash.py", line 315, in main
sample_variants_dict, samples_list, sample_project_dict = parse_vcf(vcf_list, df_var_signature,
File "/Users/bramid/PycharmProjects/AncestryML/VCF_to_hash.py", line 218, in parse_vcf
vcf = VCF(input_vcf)
File "cyvcf2/cyvcf2.pyx", line 258, in cyvcf2.cyvcf2.VCF.__init__
File "cyvcf2/cyvcf2.pyx", line 190, in cyvcf2.cyvcf2.HTSFile._open_htsfile
OSError: Error opening /Users/bramid/PycharmProjects/AncestryML/s3:/1000genomes-dragen-3.7.6/data/individuals/hg38-graph-based/NA20787/NA20787.hard-filtered.vcf.gz