rpvg icon indicating copy to clipboard operation
rpvg copied to clipboard

Unable to run rpvg analyses without error

Open JoshLangman opened this issue 1 year ago • 2 comments

I was attempting to perform analyses using rpvg on the spliced graph, pantranscriptome, and alignment produced with vg however shortly after running rpvg it crashed. The minimal files and pipeline used to reproduce this error are given below, though the same error occurs when using the full sized files.

error:

rpvg: /home/rpvg/src/fragment_length_dist.cpp:23: FragmentLengthDist::FragmentLengthDist(double, double, double, uint32_t): Assertion `isValid()' failed.
Aborted (core dumped)

gff file genomic_chr8.gff

##gff-version 3
NC_050103.1	RefSeq	region	1	182411202	.	+	.	ID=NC_050103.1:1..182411202;Dbxref=taxon:4577;Name=8;chromosome=8;cultivar=B73;gbkey=Src;genome=chromosome;mol_type=genomic DNA
NC_050103.1	BestRefSeq	gene	263720	266280	.	+	.	ID=gene-LOC100273199;Dbxref=GeneID:100273199;Name=LOC100273199;description=uncharacterized LOC100273199;gbkey=Gene;gene=LOC100273199;gene_biotype=protein_coding;gene_synonym=cl25415_1a,GRMZM6G040576
NC_050103.1	BestRefSeq	mRNA	263720	266280	.	+	.	ID=rna-NM_001147643.1;Parent=gene-LOC100273199;Dbxref=GeneID:100273199,GenBank:NM_001147643.1;Name=NM_001147643.1;gbkey=mRNA;gene=LOC100273199;product=uncharacterized LOC100273199;transcript_id=NM_001147643.1
NC_050103.1	BestRefSeq	exon	263720	264201	.	+	.	ID=exon-NM_001147643.1-1;Parent=rna-NM_001147643.1;Dbxref=GeneID:100273199,GenBank:NM_001147643.1;gbkey=mRNA;gene=LOC100273199;product=uncharacterized LOC100273199;transcript_id=NM_001147643.1
NC_050103.1	BestRefSeq	exon	264910	265104	.	+	.	ID=exon-NM_001147643.1-2;Parent=rna-NM_001147643.1;Dbxref=GeneID:100273199,GenBank:NM_001147643.1;gbkey=mRNA;gene=LOC100273199;product=uncharacterized LOC100273199;transcript_id=NM_001147643.1
NC_050103.1	BestRefSeq	exon	265204	265439	.	+	.	ID=exon-NM_001147643.1-3;Parent=rna-NM_001147643.1;Dbxref=GeneID:100273199,GenBank:NM_001147643.1;gbkey=mRNA;gene=LOC100273199;product=uncharacterized LOC100273199;transcript_id=NM_001147643.1
NC_050103.1	BestRefSeq	exon	265540	266280	.	+	.	ID=exon-NM_001147643.1-4;Parent=rna-NM_001147643.1;Dbxref=GeneID:100273199,GenBank:NM_001147643.1;gbkey=mRNA;gene=LOC100273199;product=uncharacterized LOC100273199;transcript_id=NM_001147643.1
NC_050103.1	BestRefSeq	CDS	264022	264201	.	+	0	ID=cds-NP_001141115.1;Parent=rna-NM_001147643.1;Dbxref=GeneID:100273199,GenBank:NP_001141115.1;Name=NP_001141115.1;gbkey=CDS;gene=LOC100273199;product=uncharacterized protein LOC100273199;protein_id=NP_001141115.1
NC_050103.1	BestRefSeq	CDS	264910	265104	.	+	0	ID=cds-NP_001141115.1;Parent=rna-NM_001147643.1;Dbxref=GeneID:100273199,GenBank:NP_001141115.1;Name=NP_001141115.1;gbkey=CDS;gene=LOC100273199;product=uncharacterized protein LOC100273199;protein_id=NP_001141115.1
NC_050103.1	BestRefSeq	CDS	265204	265439	.	+	0	ID=cds-NP_001141115.1;Parent=rna-NM_001147643.1;Dbxref=GeneID:100273199,GenBank:NP_001141115.1;Name=NP_001141115.1;gbkey=CDS;gene=LOC100273199;product=uncharacterized protein LOC100273199;protein_id=NP_001141115.1

vcf file chr8_3lines_renamed.vcf:

##fileformat=VCFv4.0
##FILTER=<ID=PASS,Description="All filters passed">
##Tassel=<ID=GenotypeTable,Version=5,Description="Reference allele is not known. The major allele was used as reference allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the reference and alternate alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=PL,Number=.,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##contig=<ID=NC_050103.1>
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	Mo17	P39	B97
NC_050103.1	363813	8-17988	C	G	.	PASS	DP=0;AC=2;AN=6	GT:AD:DP:GQ:PL	0/0:0,0:0:33:0,0,0	0/0:0,0:0:33:0,0,0	1/1:0,0:0:33:0,0,0
NC_050103.1	364616	8-18791	C	T	.	PASS	DP=0;AC=0;AN=6	GT:AD:DP:GQ:PL	0/0:0,0:0:33:0,0,0	0/0:0,0:0:33:0,0,0	0/0:0,0:0:33:0,0,0
NC_050103.1	364621	8-18796	C	G	.	PASS	DP=0;AC=4;AN=6	GT:AD:DP:GQ:PL	0/0:0,0:0:33:0,0,0	1/1:0,0:0:33:0,0,0	1/1:0,0:0:33:0,0,0
NC_050103.1	364622	8-18797	T	C	.	PASS	DP=0;AC=4;AN=6	GT:AD:DP:GQ:PL	0/0:0,0:0:33:0,0,0	1/1:0,0:0:33:0,0,0	1/1:0,0:0:33:0,0,0
NC_050103.1	364624	8-18799	A	C	.	PASS	DP=0;AC=4;AN=6	GT:AD:DP:GQ:PL	0/0:0,0:0:33:0,0,0	1/1:0,0:0:33:0,0,0	1/1:0,0:0:33:0,0,0
NC_050103.1	364629	8-18804	C	T	.	PASS	DP=0;AC=0;AN=6	GT:AD:DP:GQ:PL	0/0:0,0:0:33:0,0,0	0/0:0,0:0:33:0,0,0	0/0:0,0:0:33:0,0,0
NC_050103.1	364734	8-18909	C	G	.	PASS	DP=0;AC=0;AN=6	GT:AD:DP:GQ:PL	0/0:0,0:0:33:0,0,0	0/0:0,0:0:33:0,0,0	0/0:0,0:0:33:0,0,0
NC_050103.1	364782	8-18957	C	T	.	PASS	DP=0;AC=2;AN=6	GT:AD:DP:GQ:PL	0/0:0,0:0:33:0,0,0	0/0:0,0:0:33:0,0,0	1/1:0,0:0:33:0,0,0
NC_050103.1	364901	8-19076	G	T	.	PASS	DP=0;AC=2;AN=6	GT:AD:DP:GQ:PL	0/0:0,0:0:33:0,0,0	0/0:0,0:0:33:0,0,0	1/1:0,0:0:33:0,0,0
NC_050103.1	364944	8-19119	T	A	.	PASS	DP=0;AC=0;AN=6	GT:AD:DP:GQ:PL	0/0:0,0:0:33:0,0,0	0/0:0,0:0:33:0,0,0	0/0:0,0:0:33:0,0,0

fastq file SRR5911103.4reads.fastq:

@SRR5911103.1 1 length=90
CAAATTTGCATGGNTATCTGTTATTCCTTTTTANGNGTAANGNCTTNNAANANAATGTAATNNNANNNNNAAANNNNANNNNNNNNNNNN
+SRR5911103.1 1 length=90
AAAAAEEEEEEEE#EEEEEEEEEEEEEEEEEEE#E#EEEE#E#EEE##/E#/#AAEEEEEE###E#####6EA####/############
@SRR5911103.2 2 length=90
TCATATCGGTAGGTTGTGGTATTTCATTGCTACAAACATGGGTTATNGTANAATAAGACATGNNANNNNGATACNNNTCNNNNNNNNNNA
+SRR5911103.2 2 length=90
6AAAAEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEE/E#EEA#AEEEEE/EEEE##E####EEEEE###EA##########/
@SRR5911103.3 3 length=90
ATTGATGCTGTGAGATGCATGTGTGTCTTTTGTTTCACGTTGCATTNCATAGGCAAGTCGAGATGNNNNGTTGGNNNTGTNCNNNANNNT
+SRR5911103.3 3 length=90
AAAAAEEAE6EEAEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEE#EAE/EEEEEEEEEEE6EE####EEEEE###EAE#A###/###E
@SRR5911103.4 4 length=90
TTGGGGTTTGGAGTAGTTCTCATATAGATCTTTATATTCAGCCCCTNTGGGAAGATACATTTCACNNNNAAATCNNNTGANTNNNANNNT
+SRR5911103.4 4 length=90
AAAAAEEEEEEEEEEEEEEEEEEEEEE6EEEEEAEEEEEEEAEEEE#EEEEEEEAEEEEEEEEAE####EEAEE###EEE#E###A###/

yaml file environment.yml for the conda environment:

name: pantranscriptome
channels:
  - conda-forge
  - bioconda
dependencies:
  - vg =1.49
  - cmake >=3.10
  - protobuf =3
  - htslib =1.17
  - jansson =2.14
  - llvm-openmp
  - bcftools =1.17
  - sra-tools =3
  - samtools =1.17

command line pipeline:

$ bgzip chr8_3lines_renamed.vcf
$ tabix chr8_3lines_renamed.vcf.gz
$ curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_902167145.1/download?include_annotation_type=GENOME_FASTA,GENOME_GFF,RNA_FASTA,CDS_FASTA,PROT_FASTA,SEQUENCE_REPORT&filename=GCF_902167145.1.zip" -H "Accept: application/zip"
$ unzip GCF_902167145.1.zip
$ mv ncbi_dataset/data/GCF_902167145.1/GCF_902167145.1_Zm-B73-REFERENCE-NAM-5.0_genomic.fna .
$ samtools faidx GCF_902167145.1_Zm-B73-REFERENCE-NAM-5.0_genomic.fna
$ samtools faidx GCF_902167145.1_Zm-B73-REFERENCE-NAM-5.0_genomic.fna NC_050103.1 > genomic_chr8.fa
$ samtools faidx genomic_chr8.fa
$ vg construct -p -r genomic_chr8.fa -v chr8_3lines_renamed.vcf.gz > 282_rpvg.vg
$ vg convert -p 282_rpvg.vg > 282_rpvg.pg
$ vg rna -p -s Parent -t 12 -n genomic_chr8.gff -b pantranscriptome.gbwt -i pantranscriptome.txt 282_rpvg.pg > 282_rpvg.spliced.pg
$ vg convert --xg-out 282_rpvg.spliced.pg > 282_rpvg.spliced.xg
$ vg index --threads 4 --dist-name 282_rpvg.dist 282_rpvg.spliced.xg
$ vg prune --progress --threads 4 --unfold-paths --gbwt-name pantranscriptome.gbwt  --append-mapping --mapping mapping 282_rpvg.spliced.pg > 282_rpvg.spliced.pruned.vg
$ vg index --progress --threads 4 --gcsa-out 282_rpvg.gcsa --mapping mapping 282_rpvg.spliced.pruned.vg 
$ vg mpmap --nt-type RNA --threads 4 --graph-name 282_rpvg.spliced.xg --gcsa-name 282_rpvg.gcsa --dist-name 282_rpvg.dist --fastq SRR5911103.4reads.fastq > SRR5911103.gamp
$ vg gbwt --num-threads 4 --r-index pantranscriptome.gbwt.ri pantranscriptome.gbwt
$ wget https://github.com/jonassibbesen/rpvg/releases/download/v1.0/rpvg-v1.0_linux.tar.gz
$ tar zxvf rpvg-v1.0_linux.tar.gz
$ rpvg-v1.0_linux/bin/rpvg --threads 8 --single-end --frag-mean 90 --frag-sd 0 --graph 282_rpvg.spliced.xg --paths pantranscriptome.gbwt -f pantranscriptome.txt --alignments Mo17/SRR5911103.gamp -o rpvg_SRR5911103 --inference-model haplotype-transcripts

Your help in identifying the cause of this error would be greatly appreciated!

JoshLangman avatar Jul 21 '23 13:07 JoshLangman

Hi, thank you for writing. The standard deviation for the fragment length distribution (--frag-sd) needs to be above 0. Changing this should resolve the error. Please let me know if you run into any other issues.

jonassibbesen avatar Jul 23 '23 18:07 jonassibbesen

Hi @jonassibbesen, I'm working with @JoshLangman on this; thank you for the quick response.

In this example we are trying to quantify a single-end 3' RNA-seq library that has 90bp reads (no variance in read length). We have no information regarding the true fragment distribution of the library, unfortunately.

What would you recommend in this case? Perhaps the -l option that removes the effective length normalization is the right path, though I'm not sure if that assumes long reads instead of the short reads we have.

twrightsman avatar Jul 24 '23 19:07 twrightsman