Chromosome data in VCF
Hi, My SLiM simulations include multiple chromosomes (recombination rates are set at 0.5 between them, as described here ) and I'd like to create a VCF that reflects this structure. Sadly, the write_vcf method currently only accepts strings as input for the "contig_id" parameter, thus forcing every value in the CHROM column of the output VCF to be the same. I can easily create a list of the values I want in this column (since I know the end position of each chromosome), which makes it possible to parse the VCF myself, and I guess that creating a VCF for each chromosome and fusing them afterward would also be an option, but VCF files are rather heavy and many simulations are gonna run, meaning that I'd rather directly give write_vcf the informations needed to create the right file. Do you plan to give the possibility of creating VCF with multiple chromosomes in the future ? Thank you for reading, have a nice day.
Hi @MleGhis!
Thanks for your suggestion, it is normal practice to have a single chromosome per tree sequence, so the API was initially designed to support this. But as the comment you linked suggests there may be reasons for doing this.
It seems write_vcf could be adapted to take an array, but this doesn't seem ideal as the contig id should come from site metadata ideally. However, it might make sense as a solution for now as we don't have full multi-chrom support.
Would you like to prepare a PR of how this would work for the community to review? There is a development guide in the docs.
Thanks @benjeffery for your prompt reply. Unfortunately, as an intern, I might not have the time to do this right now, and I'll use another (unoptimised) solution for now. But I'll keep that in mind, and I might try later.
There's a few ways we could enable this I think:
- Add options to specify the coordinate ranges required for output in the VCF, and an option to suppress header generation. You could then write the different chunks to the same file, efficiently. I guess there might be some trouble with exact VCF conformance, if you want the contigs to be defined in the header, though.
- We add support for
contig_idbeing some kind of mapping from interval -> string. I guess we should follow the conventions in RateMap to do this. Then we could put in the header in the correct form.