rpvg
rpvg copied to clipboard
Unable to run rpvg analyses without error
I was attempting to perform analyses using rpvg
on the spliced graph, pantranscriptome, and alignment produced with vg
however shortly after running rpvg
it crashed. The minimal files and pipeline used to reproduce this error are given below, though the same error occurs when using the full sized files.
error:
rpvg: /home/rpvg/src/fragment_length_dist.cpp:23: FragmentLengthDist::FragmentLengthDist(double, double, double, uint32_t): Assertion `isValid()' failed.
Aborted (core dumped)
gff file genomic_chr8.gff
##gff-version 3
NC_050103.1 RefSeq region 1 182411202 . + . ID=NC_050103.1:1..182411202;Dbxref=taxon:4577;Name=8;chromosome=8;cultivar=B73;gbkey=Src;genome=chromosome;mol_type=genomic DNA
NC_050103.1 BestRefSeq gene 263720 266280 . + . ID=gene-LOC100273199;Dbxref=GeneID:100273199;Name=LOC100273199;description=uncharacterized LOC100273199;gbkey=Gene;gene=LOC100273199;gene_biotype=protein_coding;gene_synonym=cl25415_1a,GRMZM6G040576
NC_050103.1 BestRefSeq mRNA 263720 266280 . + . ID=rna-NM_001147643.1;Parent=gene-LOC100273199;Dbxref=GeneID:100273199,GenBank:NM_001147643.1;Name=NM_001147643.1;gbkey=mRNA;gene=LOC100273199;product=uncharacterized LOC100273199;transcript_id=NM_001147643.1
NC_050103.1 BestRefSeq exon 263720 264201 . + . ID=exon-NM_001147643.1-1;Parent=rna-NM_001147643.1;Dbxref=GeneID:100273199,GenBank:NM_001147643.1;gbkey=mRNA;gene=LOC100273199;product=uncharacterized LOC100273199;transcript_id=NM_001147643.1
NC_050103.1 BestRefSeq exon 264910 265104 . + . ID=exon-NM_001147643.1-2;Parent=rna-NM_001147643.1;Dbxref=GeneID:100273199,GenBank:NM_001147643.1;gbkey=mRNA;gene=LOC100273199;product=uncharacterized LOC100273199;transcript_id=NM_001147643.1
NC_050103.1 BestRefSeq exon 265204 265439 . + . ID=exon-NM_001147643.1-3;Parent=rna-NM_001147643.1;Dbxref=GeneID:100273199,GenBank:NM_001147643.1;gbkey=mRNA;gene=LOC100273199;product=uncharacterized LOC100273199;transcript_id=NM_001147643.1
NC_050103.1 BestRefSeq exon 265540 266280 . + . ID=exon-NM_001147643.1-4;Parent=rna-NM_001147643.1;Dbxref=GeneID:100273199,GenBank:NM_001147643.1;gbkey=mRNA;gene=LOC100273199;product=uncharacterized LOC100273199;transcript_id=NM_001147643.1
NC_050103.1 BestRefSeq CDS 264022 264201 . + 0 ID=cds-NP_001141115.1;Parent=rna-NM_001147643.1;Dbxref=GeneID:100273199,GenBank:NP_001141115.1;Name=NP_001141115.1;gbkey=CDS;gene=LOC100273199;product=uncharacterized protein LOC100273199;protein_id=NP_001141115.1
NC_050103.1 BestRefSeq CDS 264910 265104 . + 0 ID=cds-NP_001141115.1;Parent=rna-NM_001147643.1;Dbxref=GeneID:100273199,GenBank:NP_001141115.1;Name=NP_001141115.1;gbkey=CDS;gene=LOC100273199;product=uncharacterized protein LOC100273199;protein_id=NP_001141115.1
NC_050103.1 BestRefSeq CDS 265204 265439 . + 0 ID=cds-NP_001141115.1;Parent=rna-NM_001147643.1;Dbxref=GeneID:100273199,GenBank:NP_001141115.1;Name=NP_001141115.1;gbkey=CDS;gene=LOC100273199;product=uncharacterized protein LOC100273199;protein_id=NP_001141115.1
vcf file chr8_3lines_renamed.vcf
:
##fileformat=VCFv4.0
##FILTER=<ID=PASS,Description="All filters passed">
##Tassel=<ID=GenotypeTable,Version=5,Description="Reference allele is not known. The major allele was used as reference allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the reference and alternate alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=PL,Number=.,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##contig=<ID=NC_050103.1>
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Mo17 P39 B97
NC_050103.1 363813 8-17988 C G . PASS DP=0;AC=2;AN=6 GT:AD:DP:GQ:PL 0/0:0,0:0:33:0,0,0 0/0:0,0:0:33:0,0,0 1/1:0,0:0:33:0,0,0
NC_050103.1 364616 8-18791 C T . PASS DP=0;AC=0;AN=6 GT:AD:DP:GQ:PL 0/0:0,0:0:33:0,0,0 0/0:0,0:0:33:0,0,0 0/0:0,0:0:33:0,0,0
NC_050103.1 364621 8-18796 C G . PASS DP=0;AC=4;AN=6 GT:AD:DP:GQ:PL 0/0:0,0:0:33:0,0,0 1/1:0,0:0:33:0,0,0 1/1:0,0:0:33:0,0,0
NC_050103.1 364622 8-18797 T C . PASS DP=0;AC=4;AN=6 GT:AD:DP:GQ:PL 0/0:0,0:0:33:0,0,0 1/1:0,0:0:33:0,0,0 1/1:0,0:0:33:0,0,0
NC_050103.1 364624 8-18799 A C . PASS DP=0;AC=4;AN=6 GT:AD:DP:GQ:PL 0/0:0,0:0:33:0,0,0 1/1:0,0:0:33:0,0,0 1/1:0,0:0:33:0,0,0
NC_050103.1 364629 8-18804 C T . PASS DP=0;AC=0;AN=6 GT:AD:DP:GQ:PL 0/0:0,0:0:33:0,0,0 0/0:0,0:0:33:0,0,0 0/0:0,0:0:33:0,0,0
NC_050103.1 364734 8-18909 C G . PASS DP=0;AC=0;AN=6 GT:AD:DP:GQ:PL 0/0:0,0:0:33:0,0,0 0/0:0,0:0:33:0,0,0 0/0:0,0:0:33:0,0,0
NC_050103.1 364782 8-18957 C T . PASS DP=0;AC=2;AN=6 GT:AD:DP:GQ:PL 0/0:0,0:0:33:0,0,0 0/0:0,0:0:33:0,0,0 1/1:0,0:0:33:0,0,0
NC_050103.1 364901 8-19076 G T . PASS DP=0;AC=2;AN=6 GT:AD:DP:GQ:PL 0/0:0,0:0:33:0,0,0 0/0:0,0:0:33:0,0,0 1/1:0,0:0:33:0,0,0
NC_050103.1 364944 8-19119 T A . PASS DP=0;AC=0;AN=6 GT:AD:DP:GQ:PL 0/0:0,0:0:33:0,0,0 0/0:0,0:0:33:0,0,0 0/0:0,0:0:33:0,0,0
fastq file SRR5911103.4reads.fastq
:
@SRR5911103.1 1 length=90
CAAATTTGCATGGNTATCTGTTATTCCTTTTTANGNGTAANGNCTTNNAANANAATGTAATNNNANNNNNAAANNNNANNNNNNNNNNNN
+SRR5911103.1 1 length=90
AAAAAEEEEEEEE#EEEEEEEEEEEEEEEEEEE#E#EEEE#E#EEE##/E#/#AAEEEEEE###E#####6EA####/############
@SRR5911103.2 2 length=90
TCATATCGGTAGGTTGTGGTATTTCATTGCTACAAACATGGGTTATNGTANAATAAGACATGNNANNNNGATACNNNTCNNNNNNNNNNA
+SRR5911103.2 2 length=90
6AAAAEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEE/E#EEA#AEEEEE/EEEE##E####EEEEE###EA##########/
@SRR5911103.3 3 length=90
ATTGATGCTGTGAGATGCATGTGTGTCTTTTGTTTCACGTTGCATTNCATAGGCAAGTCGAGATGNNNNGTTGGNNNTGTNCNNNANNNT
+SRR5911103.3 3 length=90
AAAAAEEAE6EEAEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEE#EAE/EEEEEEEEEEE6EE####EEEEE###EAE#A###/###E
@SRR5911103.4 4 length=90
TTGGGGTTTGGAGTAGTTCTCATATAGATCTTTATATTCAGCCCCTNTGGGAAGATACATTTCACNNNNAAATCNNNTGANTNNNANNNT
+SRR5911103.4 4 length=90
AAAAAEEEEEEEEEEEEEEEEEEEEEE6EEEEEAEEEEEEEAEEEE#EEEEEEEAEEEEEEEEAE####EEAEE###EEE#E###A###/
yaml file environment.yml
for the conda environment:
name: pantranscriptome
channels:
- conda-forge
- bioconda
dependencies:
- vg =1.49
- cmake >=3.10
- protobuf =3
- htslib =1.17
- jansson =2.14
- llvm-openmp
- bcftools =1.17
- sra-tools =3
- samtools =1.17
command line pipeline:
$ bgzip chr8_3lines_renamed.vcf
$ tabix chr8_3lines_renamed.vcf.gz
$ curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_902167145.1/download?include_annotation_type=GENOME_FASTA,GENOME_GFF,RNA_FASTA,CDS_FASTA,PROT_FASTA,SEQUENCE_REPORT&filename=GCF_902167145.1.zip" -H "Accept: application/zip"
$ unzip GCF_902167145.1.zip
$ mv ncbi_dataset/data/GCF_902167145.1/GCF_902167145.1_Zm-B73-REFERENCE-NAM-5.0_genomic.fna .
$ samtools faidx GCF_902167145.1_Zm-B73-REFERENCE-NAM-5.0_genomic.fna
$ samtools faidx GCF_902167145.1_Zm-B73-REFERENCE-NAM-5.0_genomic.fna NC_050103.1 > genomic_chr8.fa
$ samtools faidx genomic_chr8.fa
$ vg construct -p -r genomic_chr8.fa -v chr8_3lines_renamed.vcf.gz > 282_rpvg.vg
$ vg convert -p 282_rpvg.vg > 282_rpvg.pg
$ vg rna -p -s Parent -t 12 -n genomic_chr8.gff -b pantranscriptome.gbwt -i pantranscriptome.txt 282_rpvg.pg > 282_rpvg.spliced.pg
$ vg convert --xg-out 282_rpvg.spliced.pg > 282_rpvg.spliced.xg
$ vg index --threads 4 --dist-name 282_rpvg.dist 282_rpvg.spliced.xg
$ vg prune --progress --threads 4 --unfold-paths --gbwt-name pantranscriptome.gbwt --append-mapping --mapping mapping 282_rpvg.spliced.pg > 282_rpvg.spliced.pruned.vg
$ vg index --progress --threads 4 --gcsa-out 282_rpvg.gcsa --mapping mapping 282_rpvg.spliced.pruned.vg
$ vg mpmap --nt-type RNA --threads 4 --graph-name 282_rpvg.spliced.xg --gcsa-name 282_rpvg.gcsa --dist-name 282_rpvg.dist --fastq SRR5911103.4reads.fastq > SRR5911103.gamp
$ vg gbwt --num-threads 4 --r-index pantranscriptome.gbwt.ri pantranscriptome.gbwt
$ wget https://github.com/jonassibbesen/rpvg/releases/download/v1.0/rpvg-v1.0_linux.tar.gz
$ tar zxvf rpvg-v1.0_linux.tar.gz
$ rpvg-v1.0_linux/bin/rpvg --threads 8 --single-end --frag-mean 90 --frag-sd 0 --graph 282_rpvg.spliced.xg --paths pantranscriptome.gbwt -f pantranscriptome.txt --alignments Mo17/SRR5911103.gamp -o rpvg_SRR5911103 --inference-model haplotype-transcripts
Your help in identifying the cause of this error would be greatly appreciated!
Hi, thank you for writing. The standard deviation for the fragment length distribution (--frag-sd
) needs to be above 0. Changing this should resolve the error. Please let me know if you run into any other issues.
Hi @jonassibbesen, I'm working with @JoshLangman on this; thank you for the quick response.
In this example we are trying to quantify a single-end 3' RNA-seq library that has 90bp reads (no variance in read length). We have no information regarding the true fragment distribution of the library, unfortunately.
What would you recommend in this case? Perhaps the -l
option that removes the effective length normalization is the right path, though I'm not sure if that assumes long reads instead of the short reads we have.