bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

Regions file parsing broken

Open bamyasi opened this issue 4 years ago • 7 comments

$ bcftools --version bcftools 1.10.2 Using htslib 1.10.2 Copyright (C) 2019 Genome Research Ltd.

$ cat regions_test.txt NC_000001.11 1

$ cat -A regions_test.txt NC_000001.11^I1$

$ bcftools view -H -R regions_test.txt freq.vcf.gz

No output produced at all but if I use -r option instead it works as expected:

$ bcftools view -H -r NC_000001.11 latest_release/freq.vcf.gz | head -2 NC_000001.11 10498 rs1338146081 G A,T . . . AN:AC 2072:0,0 6:0,0 2:0,0 76:0,0 0:0,00:0,0 2:0,0 4:0,0 26:0,0 82:0,0 4:0,0 2188:0,0 NC_000001.11 10509 rs1262211809 G A . . . AN:AC 2072:0 6:0 2:0 76:0 0:0 0:0 2:0 4:0 26:0 82:0 4:0 2188:0

Same issue if I use -T option, tab-separated file is not parsed properly. If I omit '1' start coordinate from the file then I get the following error:

$ bcftools view -H -R regions_test.txt latest_release/freq.vcf.gz [E::bcf_sr_regions_init] Could not parse the file regions_test.txt, using the columns 1,2[,-1] Failed to read the regions: regions_test.txt

The only way to make it work is to use arbitrary large dummy end coordinate in the file:

$ cat regions_test.txt NC_000001.11 1 999000000000

$ bcftools view -H -R regions_test.txt latest_release/freq.vcf.gz | head -2 NC_000001.11 10498 rs1338146081 G A,T . . . AN:AC 2072:0,0 6:0,0 2:0,0 76:0,0 0:0,00:0,0 2:0,0 4:0,0 26:0,0 82:0,0 4:0,0 2188:0,0 NC_000001.11 10509 rs1262211809 G A . . . AN:AC 2072:0 6:0 2:0 76:0 0:0 0:0 2:0 4:0 26:0 82:0 4:0 2188:0

Unfortunately, description of regions file format in the documentation is overly concise and not very helpful. It is possible I am doing something wrong but I have ran out of ideas to try.

bamyasi avatar May 04 '20 22:05 bamyasi

The columns of the regions file can contain either positions (two-column format) or intervals (three-column format): CHROM, POS, and, optionally, END. In your file you requested the position 1 on NC_000001.1 which is not present in your VCF.

pd3 avatar May 05 '20 07:05 pd3

Hi Petr,

On Tuesday, May 5, 2020 3:54:48 AM EDT Petr Danecek wrote:

The columns of the regions file can contain either positions (two-column format) or intervals (three-column format): CHROM, POS, and, optionally, END. In your file you requested the position 1 on NC_000001.1 which is not present in your VCF.

Thanks for the clarification. So, this means -r option vs -R FILE one work differently. This is kind of counter-intuitive. Also, would be great to be able to specify just a list of chromosome names in the file to retain/exclude whole chromosomes. Could you please re-open this issue as FT?

Best,

Ivan

-- Ivan Adzhubey, Ph.D. Instructor Division of Genetics, Dept of Medicine Brigham & Women's Hospital Harvard Medical School New Research Building, Room 0464C 77 Avenue Louis Pasteur Boston, MA 02115 tel.: (617) 525-4728 fax: (617) 525-4705 web: http://genetics.bwh.harvard.edu/wiki/sunyaevlab/

The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail.

bamyasi avatar May 05 '20 14:05 bamyasi

Has this been implemented?

I am interested in obtaining a subset of loci from some scaffolds (=CHR) and not others, and to include all regions within them. Is there a way to use the have it read such that the POS column in the region file can be set for all regions to be retained?

It is counter intuitive to me that -r is able to handle just CHR but -R requires both CHR and POS.

elizeng avatar Oct 27 '21 04:10 elizeng

Hi, I am facing the following issue when trying with the bcftools. (base) bilvikakasturi@Bilvikas-MacBook-Air pythonProject1 % bcftools view -G -H -R CUD_ldreg_chr3.txt ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_raw_GT_with_annot/20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr3.recalibrated_variants.annotated.vcf.gz > CUD_1000g_chr3.vcf &

[1] 82165 (base) bilvikakasturi@Bilvikas-MacBook-Air pythonProject1 % [E::bcf_sr_regions_init] Could not parse 1-th line of file CUD_ldreg_chr3.txt, using the columns 1,2[,-1] Failed to read the regions: CUD_ldreg_chr3.txt.

Below is the part of my file:

% cat CUD_ldreg_chr3.txt Chromosome Start chr3 50188115 chr3 50188115 chr3 50188115 chr3 50194955 chr3 50194955 chr3 50195267 chr3 50196401 chr3 10242228 chr3 50220177 chr3 50253085 chr3 50256191 chr3 50256199 chr3 50256722 chr3 50256722 chr3 50256936 chr3 50256936 chr3 50257499 chr3 50257499 chr3 50257690 chr3 50257690 chr3 50257690 I even tried with 4 columns but faced the same issue as above. 4 columns: Chromosome Start End Name chr3 50188115 50189075 rs58648044 chr3 50188115 50189074 rs58648044 chr3 50188115 50189073 rs58648044 chr3 50194955 50194956 rs58648044 chr3 50194955 50194956 rs58648044 chr3 50195267 50196516 rs58648044 chr3 50196401 50196515 rs58648044 chr3 10242228 10243743 rs62681760 chr3 50220177 50221486 rs56048629 chr3 50253085 50253184 rs56048629 chr3 50256191 50256320 rs56048629 chr3 50256199 50256320 rs56048629 chr3 50256722 50256852 rs56048629 chr3 50256722 50256852 rs56048629 chr3 50256936 50257090 rs56048629 chr3 50256936 50257090 rs56048629 chr3 50257499 50257714 rs56048629 chr3 50257499 50257714 rs56048629 chr3 50257690 50257714 rs56048629

Bilvikakasturi avatar Apr 20 '22 18:04 Bilvikakasturi

@Bilvikakasturi The program does print the cause: it cannot parse the 1st line of the file (pardon the grammatical error of the error message, the code does not distinguish between 1st, 2nd, 3rd, 4th, ... etc). The string Chromosome Start it not a valid region, headers should be either removed or commented out as #Chromosome Start

pd3 avatar Apr 21 '22 06:04 pd3

@pd3 It works in terminal thank you so much. But when I am trying to execute in the python script using os.system(path) it is still trying to access 1.8 and I am receiving following error: dyld: Library not loaded: @rpath/libdeflate.so Referenced from: /Users/bilvikakasturi/opt/anaconda3/bin/bcftools Reason: image not found

Could you please let me know ho wto run in the python script?

I even created the bcftools_env with 1.15 but not able to access it.

os.chdir('/Users/bilvikakasturi/opt/anaconda3/envs/bcftools_env') os.system('bcftools') Out[23]: 6 dyld: Library not loaded: @rpath/libdeflate.so Referenced from: /Users/bilvikakasturi/opt/anaconda3/bin/bcftools Reason: image not found

Bilvikakasturi avatar Apr 21 '22 17:04 Bilvikakasturi

Hi @pd3 You can close this issue as I tried to replace 1.8 version in my system with 1.15 and it works fine. Thank you for the suggestions.

Bilvikakasturi avatar Apr 21 '22 19:04 Bilvikakasturi