TileDB-VCF
TileDB-VCF copied to clipboard
export with -m (merge) option
Hello -
I am using tiledbvcf to create a dataset that I would later like to be able to export as a merged vcf file. I can successfully, load and export data from this dataset. What I would like to do is export to a multi-sample vcf file. It looks like export with the -m option should handle this, though it gives me memory errors. I added the -b flag to increase this but still no luck. The command I am running:
tiledbvcf export --uri tiledb_datasets/gvcf_dataset -m -b 65536 -o /workdir/lcj34/phg_v2/exportedHvcfs/mergedGvcf.vcf
The error I get:
Exception: SubarrayPartitioner: Trying to partition a unary range because of memory budget, this will cause the query to run very slow. Increase `sm.memory_budget` and `sm.memory_budget_var` through the configuration settings to avoid this issue. To override and run the query with the same budget, set `sm.skip_unary_partitioning_budget_check` to `true`.
Is there another trick to running the tiledbvcf export command to create a merged vcf? Thank you
I am running tiledbvcf version:
phgv2-conda) [lcj34@cbsubl01 phg_v2]$ tiledbvcf --version
TileDB-VCF version 0f72331-modified
TileDB version 2.16.3
htslib version 1.16
My machine is a linux, these specifics:
NAME="Rocky Linux"
VERSION="9.0 (Blue Onyx)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.0"
Hi @lynnjo,
Please try adding the following --tiledb-config options to your export command, which will increase sm.memory_budget to 10GiB, sm.memory_budget_var to 20GiB, and skip the memory budget check.
tiledbvcf export \
--uri tiledb_datasets/gvcf_dataset \
-m -b 65536 \
-o /workdir/lcj34/phg_v2/exportedHvcfs/mergedGvcf.vcf \
--tiledb-config sm.memory_budget=10737418240,sm.memory_budget_var=21474836480,sm.skip_unary_partitioning_budget_check=true
The export may be slow, as reported by the original error message, because we have not optimized the performance of exporting a merged VCF yet.
Thanks @gspowley - I will try the above.
Do I still keep the "-b 65536" flag while adding the last line you show?
One more question: We note that GATK can export a multi-sample vcf using the "gatk -GenomeGVCFs -V genodb://<link to genomedb created via the GenomicsDBImport option>" and that is relatively fast. I know tiledbvcf originated as genomicsDB. Is the reason this works from GATK due to GATK doing some of the work to merge the files?
Yes, keeping the -b 65535 option will improve the export performance, assuming your system has enough memory. The memory budget parameters may need some tuning based on your dataset and system resources.