Jerome Kelleher
Jerome Kelleher
Can you dig in a bit more here @benjeffery, and maybe give us the output of ``partition_into_regions``? I wonder if the CSI index just isn't splitting up the these small...
Ah, I'm just after hitting this issue now. CSI indexed VCFs with multiple contigs in the header are not being treated correctly.
Marking this as a bug, as there's a good chance of data-loss as a result of this.
Closing in favour of #1201 ( think this is an instance of that bug)
> I've looked at the other format parsing methods and they both (bgen and plink) load the data into cluster memory, instead of writing parts to disk as the VCF...
Raising this one again - it really would be helpful to get some debug output when doing large VCF conversions. I'm very much flying blind trying to make a large...
Can we do the scatter plot with matplotlib or something to avoid the problem?
Also useful to think about how we can get the number of alleles per site - filtering out biallelic is something basic people will want to do.
Temporarily putting this into the 0.6.0 release in case it can be changed now without breaking people's code
Dropping this from 0.6.0 then