sgkit Docs: what is the cost of ``read_chunk_length`` in vcf_to

The docs for read_chunk_length currently say:

Length (number of variants) of chunks to read from the VCF file at a time. Use this option to reduce memory usage by using a value lower than chunk_length with a small cost in extra run time. The increase in runtime becomes higher as the ratio of read_chunk_length to Defaults to None, which means that a value equal to chunk_length is used.

What's "small" here? I think the " The increase in runtime becomes higher.." sentence is incomplete also.

I'm trying this for a vcf with 1M samples:

  sgkit.io.vcf.vcf_to_zarr(infile, outfile,
            temp_chunk_length=10_000,
            read_chunk_length=100,
            tempdir="tmp/")

and it seems to still be working the first 20 of 2040 tasks.

Sep 08 '23 08:09 jeromekelleher

Yes, that sentence should be "The increase in runtime becomes higher as the ratio of read_chunk_length to temp_chunk_length gets higher" You're at a ratio of 100, which is much higher than the 2-10 range I've been using with UKB/GEL. What RAM usage do your workers have with 100? Can you go to:

            temp_chunk_length=500,
            read_chunk_length=2000,
            chunk_length=10_000

without hitting RAM issues?

Sep 08 '23 09:09 benjeffery

Thanks, that seems to be working quite well now.

Sep 08 '23 11:09 jeromekelleher

Oops just realised I reversed read and temp in my last comment. Read should be smaller than temp, should be smaller than final chunk. If it is working with read=4000 then you should be able to stick with that and increase the others.

Sep 08 '23 11:09 benjeffery

It's actually chugging away pretty well, so I'm going to leave it.

Sep 08 '23 12:09 jeromekelleher

Docs: what is the cost of ``read_chunk_length`` in vcf_to_zare