bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

bcftools reheader -f/--fai does not preserve the order of the contigs (even when it could)

Open freeseek opened this issue 4 years ago • 4 comments

I have noticed the following behavior.

Suppose I have two files and a reference index:

echo -e "##fileformat=VCFv4.2\n##contig=<ID=chr1,length=248956422>\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\nchr1\t1\t.\tA\tC\t.\t.\t." | bcftools view --no-version -Ob -o chr1.bcf
echo -e "##fileformat=VCFv4.2\n##contig=<ID=chr2,length=242193529>\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\nchr2\t1\t.\tG\tT\t.\t.\t." | bcftools view --no-version -Ob -o chr2.bcf
echo -e "chr1\t248956422\t112\t70\t71\nchr2\t242193529\t252513167\t70\t71" > ref.fai

I then use the reference index to reheader the VCFs:

bcftools reheader -f ref.fai -o chr1.reheader.bcf chr1.bcf
bcftools reheader -f ref.fai -o chr2.reheader.bcf chr2.bcf

Unintuitively, the order of the contigs is not the same in the two files now:

$ bcftools view --no-version chr1.reheader.bcf
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=248956422>
##contig=<ID=chr2,length=242193529>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr1	1	.	A	C	.	.	.

$ bcftools view --no-version chr2.reheader.bcf
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr2,length=242193529>
##contig=<ID=chr1,length=248956422>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr2	1	.	G	T	.	.	.

This makes it impossible to concatenate files with the -n option:

$ bcftools concat -n chr1.reheader.bcf chr2.reheader.bcf
Checking the headers of 2 files.
Cannot use --naive, use --naive-force instead: different order the tag contig/chr1 in chr1.reheader.bcf vs chr2.reheader.bcf

And, of course, using the --naive-force option would cause inappropriate behavior:

$ bcftools concat -n --naive-force chr1.reheader.bcf chr2.reheader.bcf | bcftools view --no-version
Concatenating chr1.reheader.bcf	0.000709 seconds
Concatenating chr2.reheader.bcf	0.000503 seconds
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=248956422>
##contig=<ID=chr2,length=242193529>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr1	1	.	A	C	.	.	.
chr1	1	.	G	T	.	.	.

However, I do understand why bcftools reheader -f would not want to reorder the contigs, as that would potentially cause the VCF file to become unsorted. At the same time I thought that the bcftools concat -n option was exactly supposed to work after the header of the VCF files were updated with the bcftools reheader -f option.

freeseek avatar Feb 20 '21 02:02 freeseek

The problem is caused by the fact that chromosome names are not stored in the BCF body, only pointers to the header. Therefore the header cannot be changed without also changing the body (in general case). The only solution is to provide both files with all contig names at the time they are created.

pd3 avatar Feb 24 '21 15:02 pd3

To give some context, this issue arises when using the IMPUTE5 software which does not copy the VCF header from the input VCF files (panel and target) and does not provide an option for modifying the header in the output VCF file. I do understand that the contig order cannot be modified changing the header only. The bcftools concat command can indeed reorder the contigs. Is there any other command in BCFtools that can do that? What would you do if you wanted to generate two VCF files that can be concatenated with bcftools concat -n if you started with two VCFs like the ones in this example? My goal is to provide an imputation pipeline outputting single chromosome VCFs and it would be a plus if users could quickly concatenate them if that's what they want to do.

freeseek avatar Feb 24 '21 15:02 freeseek

Hmm, this is tricky.

  • if there was an intermediate VCF file, I'd run reheader -f on those before converting to BCF. I think the order in VCFs should be preserved and if not, could be made so.
  • a new option could be added to bcftools sort to reorder contig blocks to reflect the order in a provided fasta file
  • concatenate could be run without the -n option. The advantage of speed would be lost, but then the above steps require more processing as well. But those can be done in parallel, so still worth considering.

pd3 avatar Feb 24 '21 15:02 pd3

Okay, thank you for the information. In this case IMPUTE5 directly generates BCFs (which is nice!). I suppose a possible hack would be then to concatenate and empty VCF file with the correct header, like this:

echo -e "##fileformat=VCFv4.2\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO" > tmp.vcf
bcftools reheader -f ref.fai tmp.vcf > fai.vcf
bcftools concat --no-version -Ob -o chr1.reheader.bcf fai.vcf chr1.bcf
bcftools concat --no-version -Ob -o chr2.reheader.bcf fai.vcf chr2.bcf

This seems to actually work:

$ bcftools concat -n chr1.reheader.bcf chr2.reheader.bcf | bcftools view --no-version
Concatenating chr1.reheader.bcf	0.000709 seconds
Concatenating chr2.reheader.bcf	0.000503 seconds
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=248956422>
##contig=<ID=chr2,length=242193529>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr1	1	.	A	C	.	.	.
chr2	1	.	G	T	.	.	.

freeseek avatar Feb 24 '21 16:02 freeseek