TileDB-VCF icon indicating copy to clipboard operation
TileDB-VCF copied to clipboard

export with -m (merge) option

Open lynnjo opened this issue 1 year ago • 5 comments

Hello -

I am using tiledbvcf to create a dataset that I would later like to be able to export as a merged vcf file. I can successfully, load and export data from this dataset. What I would like to do is export to a multi-sample vcf file. It looks like export with the -m option should handle this, though it gives me memory errors. I added the -b flag to increase this but still no luck. The command I am running:

tiledbvcf export --uri tiledb_datasets/gvcf_dataset  -m -b 65536 -o /workdir/lcj34/phg_v2/exportedHvcfs/mergedGvcf.vcf

The error I get:

Exception: SubarrayPartitioner: Trying to partition a unary range because of memory budget, this will cause the query to run very slow. Increase `sm.memory_budget` and `sm.memory_budget_var` through the configuration settings to avoid this issue. To override and run the query with the same budget, set `sm.skip_unary_partitioning_budget_check` to `true`.

Is there another trick to running the tiledbvcf export command to create a merged vcf? Thank you

I am running tiledbvcf version:

phgv2-conda) [lcj34@cbsubl01 phg_v2]$ tiledbvcf --version
TileDB-VCF version 0f72331-modified
TileDB version 2.16.3
htslib version 1.16

My machine is a linux, these specifics:

NAME="Rocky Linux"
VERSION="9.0 (Blue Onyx)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.0"

lynnjo avatar Jan 22 '24 13:01 lynnjo

Hi @lynnjo,

Please try adding the following --tiledb-config options to your export command, which will increase sm.memory_budget to 10GiB, sm.memory_budget_var to 20GiB, and skip the memory budget check.

tiledbvcf export \
  --uri tiledb_datasets/gvcf_dataset  \
  -m -b 65536 \
  -o /workdir/lcj34/phg_v2/exportedHvcfs/mergedGvcf.vcf \
  --tiledb-config sm.memory_budget=10737418240,sm.memory_budget_var=21474836480,sm.skip_unary_partitioning_budget_check=true

The export may be slow, as reported by the original error message, because we have not optimized the performance of exporting a merged VCF yet.

gspowley avatar Jan 22 '24 19:01 gspowley

Thanks @gspowley - I will try the above.

Do I still keep the "-b 65536" flag while adding the last line you show?

One more question: We note that GATK can export a multi-sample vcf using the "gatk -GenomeGVCFs -V genodb://<link to genomedb created via the GenomicsDBImport option>" and that is relatively fast. I know tiledbvcf originated as genomicsDB. Is the reason this works from GATK due to GATK doing some of the work to merge the files?

lynnjo avatar Jan 22 '24 20:01 lynnjo

Yes, keeping the -b 65535 option will improve the export performance, assuming your system has enough memory. The memory budget parameters may need some tuning based on your dataset and system resources.

gspowley avatar Jan 22 '24 20:01 gspowley