Docs: what is the cost of ``read_chunk_length`` in vcf_to_zare
The docs for read_chunk_length currently say:
Length (number of variants) of chunks to read from the VCF file at a time. Use this option to reduce memory usage by using a value lower than chunk_length with a small cost in extra run time. The increase in runtime becomes higher as the ratio of read_chunk_length to Defaults to None, which means that a value equal to chunk_length is used.
What's "small" here? I think the " The increase in runtime becomes higher.." sentence is incomplete also.
I'm trying this for a vcf with 1M samples:
sgkit.io.vcf.vcf_to_zarr(infile, outfile,
temp_chunk_length=10_000,
read_chunk_length=100,
tempdir="tmp/")
and it seems to still be working the first 20 of 2040 tasks.
Yes, that sentence should be "The increase in runtime becomes higher as the ratio of read_chunk_length to temp_chunk_length gets higher" You're at a ratio of 100, which is much higher than the 2-10 range I've been using with UKB/GEL. What RAM usage do your workers have with 100? Can you go to:
temp_chunk_length=500,
read_chunk_length=2000,
chunk_length=10_000
without hitting RAM issues?
Thanks, that seems to be working quite well now.
Oops just realised I reversed read and temp in my last comment. Read should be smaller than temp, should be smaller than final chunk. If it is working with read=4000 then you should be able to stick with that and increase the others.
It's actually chugging away pretty well, so I'm going to leave it.