vg icon indicating copy to clipboard operation
vg copied to clipboard

VG crashed when constructing chromosomes longer than 100MB

Open YiBenqiu opened this issue 1 year ago • 11 comments

1. What were you trying to do? I tried to perform vg construct on each chromosome vg construct -C -S -R ${chrs[SLURM_ARRAY_TASK_ID]} -r $ref -v ${split_chrs}/${chrs[SLURM_ARRAY_TASK_ID]}_tmp/convert.${chrs[SLURM_ARRAY_TASK_ID]}_sort.vcf.gz -p -a -t 24 > ${GRAPHS}/${chrs[SLURM_ARRAY_TASK_ID]}.vg 2. What did you want to happen? Each chromosome can generate a file with a .vg suffix 3. What actually happened? When I perform vg construct on chromosomes of smaller length, there are no error messages, but when the chromosome exceeds 100MB, the following error occurs:

2023-11-08 23:05:26 begin vg construct chr2
Restricting to chr2 from 1 to end
warning:[vg::Constructor] Skipping duplicate variant with hash 58711d4d3971fdc54706a60aa7754eec8aeb5662 at chr2:56309312
terminate called after throwing an instance of 'std::runtime_error'
  what():  io::MessageEmitter::write: message too large
━━━━━━━━━━━━━━━━━━━━
Crash report for vg v1.52.0 "Bozen"
Stack trace (most recent call last):
#18   Object "/BIGDATA2/scau_jyang_1/.conda/envs/vg_env/bin/vg", at 0x616bb4, in _start
#17   Object "/BIGDATA2/scau_jyang_1/.conda/envs/vg_env/bin/vg", at 0x2065016, in __libc_start_main
#16   Object "/BIGDATA2/scau_jyang_1/.conda/envs/vg_env/bin/vg", at 0x20637b9, in __libc_start_call_main
#15   Object "/BIGDATA2/scau_jyang_1/.conda/envs/vg_env/bin/vg", at 0xdf9b7b, in vg::subcommand::Subcommand::operator()(int, char**) const
#14   Object "/BIGDATA2/scau_jyang_1/.conda/envs/vg_env/bin/vg", at 0xcaa1db, in main_construct(int, char**)
#13   Object "/BIGDATA2/scau_jyang_1/.conda/envs/vg_env/bin/vg", at 0xf84bee, in vg::Constructor::construct_graph(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::function<void (vg::Graph&)> const&)
#12   Object "/BIGDATA2/scau_jyang_1/.conda/envs/vg_env/bin/vg", at 0xf83727, in vg::Constructor::construct_graph(std::vector<FastaReference*, std::allocator<FastaReference*> > const&, std::vector<vcflib::VariantCallFile*, std::allocator<vcflib::VariantCallFile*> > const&, std::vector<FastaReference*, std::allocator<FastaReference*> > const&, std::function<void (vg::Graph&)> const&)
#11   Object "/BIGDATA2/scau_jyang_1/.conda/envs/vg_env/bin/vg", at 0xf82983, in vg::Constructor::construct_graph(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, FastaReference&, vg::VcfBuffer&, std::vector<FastaReference*, std::allocator<FastaReference*> > const&, std::function<void (vg::Graph&)> const&)
#10   Object "/BIGDATA2/scau_jyang_1/.conda/envs/vg_env/bin/vg", at 0xcab880, in std::_Function_handler<void (vg::Graph&), main_construct(int, char**)::{lambda(vg::Graph&)#1}>::_M_invoke(std::_Any_data const&, vg::Graph&)
#9    Object "/BIGDATA2/scau_jyang_1/.conda/envs/vg_env/bin/vg", at 0xb37d70, in vg::io::ProtobufEmitter<vg::Graph>::write_copy(vg::Graph const&)
#8    Object "/BIGDATA2/scau_jyang_1/.conda/envs/vg_env/bin/vg", at 0x2060ebd, in _Unwind_Resume
#7    Object "/BIGDATA2/scau_jyang_1/.conda/envs/vg_env/bin/vg", at 0x206033b, in _Unwind_RaiseException_Phase2
#6    Object "/BIGDATA2/scau_jyang_1/.conda/envs/vg_env/bin/vg", at 0x1f9caf9, in __gxx_personality_v0
#5    Object "/BIGDATA2/scau_jyang_1/.conda/envs/vg_env/bin/vg", at 0x2037008, in __cxa_call_terminate
#4    Object "/BIGDATA2/scau_jyang_1/.conda/envs/vg_env/bin/vg", at 0x1f9d3ab, in __cxxabiv1::__terminate(void (*)())
#3    Object "/BIGDATA2/scau_jyang_1/.conda/envs/vg_env/bin/vg", at 0x5e31c3, in __gnu_cxx::__verbose_terminate_handler() [clone .cold]
#2    Object "/BIGDATA2/scau_jyang_1/.conda/envs/vg_env/bin/vg", at 0x5e590b, in abort
#1    Object "/BIGDATA2/scau_jyang_1/.conda/envs/vg_env/bin/vg", at 0x207c1f5, in raise
#0    Object "/BIGDATA2/scau_jyang_1/.conda/envs/vg_env/bin/vg", at 0x20a8bfc, in __pthread_kill
ERROR: Signal 6 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug.
Please include this entire error log in your bug report!

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

Place stacktrace here.

5. What data and command can the vg dev team use to make the problem happen?

vg construct -C -S -R ${chrs[SLURM_ARRAY_TASK_ID]} -r $ref -v ${split_chrs}/${chrs[SLURM_ARRAY_TASK_ID]}_tmp/convert.${chrs[SLURM_ARRAY_TASK_ID]}_sort.vcf.gz -p -a -t 24 >  ${GRAPHS}/${chrs[SLURM_ARRAY_TASK_ID]}.vg

6. What does running vg version say?

vg version v1.52.0 "Bozen"
Compiled with g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 on Linux
Linked against libstd++ 20230528
Built by jeizenga@emerald

YiBenqiu avatar Nov 08 '23 15:11 YiBenqiu

When I divided the SVs in each chromosome's sv.vcf.gz file, such as chr2_sv.vcf.gz, into 9 parts and then ran vg construct on each part separately, it was successful!

vg construct -C -S -R chr2 -r $ref -v chr2_sv_array_${array}.vcf.gz -p -a -t 24 >  chr2_array_${array}.vg
VCFtools - v0.1.13
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf chr2_sv.vcf.gz
	--recode-INFO-all
	--recode
	--snps chr2_sv_array_8
	--stdout

Using zlib version: 1.2.7
After filtering, kept 182 out of 182 Individuals
Outputting VCF file...
After filtering, kept 3320 out of a possible 29888 Sites
Run Time = 20.00 seconds
2023-11-09 10:14:48 begin vg construct chr2_array_8
Restricting to chr2 from 1 to end
 building graph for chr2        [=======================]100.0%
2023-11-09 10:15:03 done vg construct chr2_array_8
-rw-r--r-- 1 qyb qyb  95511083 Nov  8 22:45 chr10.vg
-rw-r--r-- 1 qyb qyb  89961205 Nov  8 22:45 chr11.vg
-rw-r--r-- 1 qyb qyb  93974633 Nov  8 22:45 chr12.vg
-rw-r--r-- 1 qyb qyb   2868467 Nov  8 22:46 chr14.vg
-rw-r--r-- 1 qyb qyb   4707842 Nov  8 22:46 chr15.vg
-rw-r--r-- 1 qyb qyb  83434027 Nov  8 22:46 chr16.vg
-rw-r--r-- 1 qyb qyb  77666868 Nov  8 22:46 chr17.vg
-rw-r--r-- 1 qyb qyb  75580734 Nov  8 22:46 chr18.vg
-rw-r--r-- 1 qyb qyb 198326081 Nov  9 10:09 chr2_array_0.vg
-rw-r--r-- 1 qyb qyb 110442282 Nov  9 10:09 chr2_array_1.vg
-rw-r--r-- 1 qyb qyb 136858691 Nov  9 10:10 chr2_array_2.vg
-rw-r--r-- 1 qyb qyb 108568916 Nov  9 10:11 chr2_array_3.vg
-rw-r--r-- 1 qyb qyb 118332590 Nov  9 10:12 chr2_array_4.vg
-rw-r--r-- 1 qyb qyb 120566392 Nov  9 10:13 chr2_array_5.vg
-rw-r--r-- 1 qyb qyb 105551581 Nov  9 10:13 chr2_array_6.vg
-rw-r--r-- 1 qyb qyb 104397436 Nov  9 10:14 chr2_array_7.vg
-rw-r--r-- 1 qyb qyb 100756808 Nov  9 10:15 chr2_array_8.vg
...
...
...

With this method of processing, can I proceed normally with the next steps? For example: vg ids (chr1, chr2_array_0...8, chr3...) and vg index?

YiBenqiu avatar Nov 09 '23 02:11 YiBenqiu

Sorry, this procedure will not produce the outputs you want. Each of the chromosome 2 graphs will contain the full reference sequence along with the variants from its VCF file. Unfortunately, I'm not sure I know what the fix for this issue would be in vg construct. Have you already tried using vg autoindex to construct the graphs?

jeizenga avatar Nov 09 '23 18:11 jeizenga

Thank you for your advice. I have merged SVs from all chromosomes into a single VCF file and am currently running 'vg autoindex --workflow giraffe to build the pan-genome.

vg autoindex --workflow giraffe -R XG -r $ref -v ${split_chrs}/convert.final_sv.vcf.gz --tmp-dir ${LARGE_DISK_TMP} -t 10 -p out

The program has been running for 8 hours now, and it may be due to the large number of SVs (approximately 400,000 SVs). The program is still displaying 'Chunking VCF(s).'

[IndexRegistry]: Checking for phasing in VCF(s).
[IndexRegistry]: Chunking inputs for parallelism.
[IndexRegistry]: Chunking FASTA(s).
[IndexRegistry]: Chunking VCF(s).

YiBenqiu avatar Nov 10 '23 11:11 YiBenqiu

That is often a slow part of the pipeline, especially if you have many samples and a single VCF file. One way to speed it up is to use chromosome-level VCF files instead of a single VCF file. That allows for more parallelism in parsing the data.

jeizenga avatar Nov 10 '23 18:11 jeizenga

That is often a slow part of the pipeline, especially if you have many samples and a single VCF file. One way to speed it up is to use chromosome-level VCF files instead of a single VCF file. That allows for more parallelism in parsing the data.

Please forgive my lack of familiarity with vg autoindex. I guess you mean to perform autoindex on the sv.vcf.gz of each chromosome? For example, using the following command:

vg autoindex --workflow giraffe -R XG -r $ref -v chr2.sv.vcf.gz --tmp-dir ${LARGE_DISK_TMP} -t 10 -p out

Where $ref is the path to the reference genome, and chr2.sv.vcf.gz is the SV file for the second chromosome Then, after obtaining the .vg file for each chromosome, use vg ids and manual indexing?

YiBenqiu avatar Nov 11 '23 06:11 YiBenqiu

Firstly, I strongly recommend against using the -R option. We do not provide stable behavior for that option, and the results can be unpredictable if you don't know how the internals of vg autoindex are set up, often leading to invalid indexes. It's primarily intended as a developer tool.

If you want to include multiple VCF files, you just supply the -v flag multiple times. The command would be like this:

vg autoindex --workflow giraffe -r $ref -v chr1.sv.vcf.gz -v chr2.sv.vcf.gz  -v chr3.sv.vcf.gz ... --tmp-dir ${LARGE_DISK_TMP} -t 10 -p out

jeizenga avatar Nov 11 '23 17:11 jeizenga

Firstly, I strongly recommend against using the -R option. We do not provide stable behavior for that option, and the results can be unpredictable if you don't know how the internals of vg autoindex are set up, often leading to invalid indexes. It's primarily intended as a developer tool.

If you want to include multiple VCF files, you just supply the -v flag multiple times. The command would be like this:

vg autoindex --workflow giraffe -r $ref -v chr1.sv.vcf.gz -v chr2.sv.vcf.gz  -v chr3.sv.vcf.gz ... --tmp-dir ${LARGE_DISK_TMP} -t 10 -p out

I am really grateful to you. I will run vg autoindex again following your advice.

YiBenqiu avatar Nov 11 '23 18:11 YiBenqiu

Firstly, I strongly recommend against using the -R option. We do not provide stable behavior for that option, and the results can be unpredictable if you don't know how the internals of vg autoindex are set up, often leading to invalid indexes. It's primarily intended as a developer tool.

If you want to include multiple VCF files, you just supply the -v flag multiple times. The command would be like this:

vg autoindex --workflow giraffe -r $ref -v chr1.sv.vcf.gz -v chr2.sv.vcf.gz  -v chr3.sv.vcf.gz ... --tmp-dir ${LARGE_DISK_TMP} -t 10 -p out

Hello, following your advice, I used autodex to build a graph in GBZ format along with its corresponding index, and then performed alignment and SV genotyping using the following command:

vg snarls out.giraffe.gbz >out.snarls
vg giraffe -p -t 24 -Z out.giraffe.gbz -d out.dist -m out.min  -f ${sample}_1_clean.fq.gz -f ${sample}_2_clean.fq.gz > ${sample}.gam
vg pack -t 24 -x out.giraffe.gbz -g ${sample}.gam -o ${sample}.pack
vg call -t 24 out.giraffe.gbz -r out.snarls -k ${sample}.pack -s ${sample} -a -A -z > ${sample}.vcf

In the construction of the pan-genome, I used an SV.vcf.gz file containing 175 individuals, which includes 470,000 SV entries. However, I've noticed that when using 'vg call' for SV genotyping on a sample, the resulting VCF file for that sample contains only around 200,000 SV entries. I've reviewed similar questions from others in previous inquiries #3636. I also tried using the '-A' option as mentioned in this question , but it only increased the SV count by an additional 10,000 SVs. I'm not sure if this is a normal occurrence. Could you please tell me the possible reasons for this situation?

YiBenqiu avatar Nov 13 '23 11:11 YiBenqiu

The most likely reason is that some SVs overlap in the genome. When that happens, vg call will treat the overlapping SVs as a single locus and genotype them as a single variant with multiple alleles.

jeizenga avatar Nov 13 '23 20:11 jeizenga

I'm very sorry to bother you again. We used the Sniffles software to perform SV calling on over 150 long-reads sequencing datasets, resulting in SV1.vcf.gz. We also used the SVmu software to compare 20 genomes with a reference genome, resulting in SV2.vcf.gz. Finally, we merged these two sets of SVs using the SURVIVOR software, and then the merged.SV.vcf.gz was used to vg autoindex. While you mentioned that there might be a reduction in individual SVs due to overlapping SVs, we noticed that there doesn't seem to be any SVs from the genome comparison present in the sample VCF files (vg call -A -a).

Upon reviewing other submitted questions #4144, we found that in the 'autoindex --workflow giraffe' process, once the number of haplotypes reaches 192, vg autoindex downsamples them to 64 synthetic haplotypes. Could this be the issue? When using autoindex, does it only consider the first 192 haplotypes (corresponding to 96 individuals)? If this is indeed the reason, can I retain only one individual in the merged.SV.vcf.gz file, and then change the genotype of this individual at each structural variation (SV) to 1/1? After making these changes to the merged.SV.vcf.gz file, which now contains only one individual, I plan to use it for vg autoindex. My aim is to enable subsequent second-generation sequencing samples to yield more structural variations with genotype information when processed with vg call. However, I am not sure if this approach is correct.

The most likely reason is that some SVs overlap in the genome. When that happens, vg call will treat the overlapping SVs as a single locus and genotype them as a single variant with multiple alleles.

YiBenqiu avatar Nov 14 '23 04:11 YiBenqiu

Sorry for the delayed response. Do I understand correctly that all of the variants from SVmu are missing? I doubt that the reason is the haplotype sampling, which should select variants roughly proportionally to their frequency in the input VCFs. Have you also checked the merged VCF to see if they're being lost at that step?

The procedure you're suggesting to include only alt alleles will create a graph that has all of the variants included, but it will not create a meaningful haplotype panel. If you plan to use vg giraffe, then the haplotype panel is also used to map reads.

jeizenga avatar Dec 01 '23 20:12 jeizenga