kallisto icon indicating copy to clipboard operation
kallisto copied to clipboard

Invalid segment Error when building reference of the rat genome

Open kim-fehl opened this issue 2 years ago • 3 comments

I'm using kb ref command from kallisto-bustools package and RefSeq genome mRatBN7.2 of Rattus norvegicus. Here is the command:

kb ref -i mRATBN7.cdna.idx -g mRATBN7.t2g -f1 mRATBN7.cdna.fna ~/genomes/GCF_015227675.2_mRatBN7.2_genomic.fna ~/genomes/GCF_015227675.2_mRatBN7.2_genomic.gtf

The command fails with the following error message:

ngs_tools.gtf.Segment.SegmentError: Invalid segment [183055283:183055283)

To make it work, I have to eliminate lines with error-causing coordinates from the gtf file like this:

kb ref -i mRATBN7.cdna.idx -g mRATBN7.t2g -f1 mRATBN7.cdna.fna ~/genomes/GCF_015227675.2_mRatBN7.2_genomic.fna <(grep -vE "183055283|128787877|145599560|105962786|24017317|23500941" ~/genomes/GCF_015227675.2_mRatBN7.2_genomic.gtf)

Probably these coordinates can help you to investigate the issue. Thank you!

kim-fehl avatar Feb 05 '22 19:02 kim-fehl

The issue is 183055283:183055283 is an invalid coordinate (it has length 0). Such a coordinate should not exist in a properly formatted GTF file. It seems to me like it's a GTF formatting issue rather than a kb ref issue. When you grep those coordinates, what do those problematic lines in the GTF file look like?

Yenaled avatar Feb 05 '22 20:02 Yenaled

Here are the results of the grep with two flanking lines: grep -A1 -B1 -E "183055283|128787877|145599560|105962786|24017317|23500941" ~/genomes/GCF_015227675.2_mRatBN7.2_genomic.gtf

NC_051337.1     BestRefSeq      exon    183053722       183053888       .       +       .       gene_id "Arnt"; transcript_id "NM_012780.2"; db_xref "GeneID:25242"; exception "annotated by transcript or proteomic data"; gene "Arnt"; inference "similar to RNA sequence, mRNA (same species):RefSeq:NM_012780.2"; note "The RefSeq transcript has 1 substitution and aligns at 90% coverage compared to this genomic sequence"; partial "true"; product "aryl hydrocarbon receptor nuclear translocator, transcript variant 1"; transcript_biotype "mRNA"; exon_number "21"; 
NC_051337.1     BestRefSeq      exon    183054637       183055283       .       +       .       gene_id "Arnt"; transcript_id "NM_012780.2"; db_xref "GeneID:25242"; exception "annotated by transcript or proteomic data"; gene "Arnt"; inference "similar to RNA sequence, mRNA (same species):RefSeq:NM_012780.2"; note "The RefSeq transcript has 1 substitution and aligns at 90% coverage compared to this genomic sequence"; partial "true"; product "aryl hydrocarbon receptor nuclear translocator, transcript variant 1"; transcript_biotype "mRNA"; exon_number "22"; 
NC_051337.1     BestRefSeq      exon    183055284       183056584       .       +       .       gene_id "Arnt"; transcript_id "NM_012780.2"; db_xref "GeneID:25242"; exception "annotated by transcript or proteomic data"; gene "Arnt"; inference "similar to RNA sequence, mRNA (same species):RefSeq:NM_012780.2"; note "The RefSeq transcript has 1 substitution and aligns at 90% coverage compared to this genomic sequence"; partial "true"; product "aryl hydrocarbon receptor nuclear translocator, transcript variant 1"; transcript_biotype "mRNA"; exon_number "23"; 
--
NC_051337.1     BestRefSeq      exon    183053722       183053888       .       +       .       gene_id "Arnt"; transcript_id "NM_001389231.1"; db_xref "GeneID:25242"; exception "annotated by transcript or proteomic data"; gene "Arnt"; inference "similar to RNA sequence, mRNA (same species):RefSeq:NM_001389231.1"; note "The RefSeq transcript has 1 substitution and aligns at 90% coverage compared to this genomic sequence"; partial "true"; product "aryl hydrocarbon receptor nuclear translocator, transcript variant 2"; transcript_biotype "mRNA"; exon_number "20"; 
NC_051337.1     BestRefSeq      exon    183054637       183055283       .       +       .       gene_id "Arnt"; transcript_id "NM_001389231.1"; db_xref "GeneID:25242"; exception "annotated by transcript or proteomic data"; gene "Arnt"; inference "similar to RNA sequence, mRNA (same species):RefSeq:NM_001389231.1"; note "The RefSeq transcript has 1 substitution and aligns at 90% coverage compared to this genomic sequence"; partial "true"; product "aryl hydrocarbon receptor nuclear translocator, transcript variant 2"; transcript_biotype "mRNA"; exon_number "21"; 
NC_051337.1     BestRefSeq      exon    183055284       183056584       .       +       .       gene_id "Arnt"; transcript_id "NM_001389231.1"; db_xref "GeneID:25242"; exception "annotated by transcript or proteomic data"; gene "Arnt"; inference "similar to RNA sequence, mRNA (same species):RefSeq:NM_001389231.1"; note "The RefSeq transcript has 1 substitution and aligns at 90% coverage compared to this genomic sequence"; partial "true"; product "aryl hydrocarbon receptor nuclear translocator, transcript variant 2"; transcript_biotype "mRNA"; exon_number "22"; 
--
NC_051339.1     BestRefSeq      exon    128787878       128789296       .       -       .       gene_id "LOC500265"; transcript_id "NR_144449.1"; db_xref "GeneID:500265"; exception "annotated by transcript or proteomic data"; gene "LOC500265"; inference "similar to RNA sequence (same species):RefSeq:NR_144449.1"; note "The RefSeq transcript aligns at 88% coverage compared to this genomic sequence"; partial "true"; product "nucleoporin 50-like"; pseudo "true"; transcript_biotype "transcript"; exon_number "1"; 
NC_051339.1     BestRefSeq      exon    128787567       128787877       .       -       .       gene_id "LOC500265"; transcript_id "NR_144449.1"; db_xref "GeneID:500265"; exception "annotated by transcript or proteomic data"; gene "LOC500265"; inference "similar to RNA sequence (same species):RefSeq:NR_144449.1"; note "The RefSeq transcript aligns at 88% coverage compared to this genomic sequence"; partial "true"; product "nucleoporin 50-like"; pseudo "true"; transcript_biotype "transcript"; exon_number "2"; 
NC_051339.1     Gnomon  gene    129086996       129146558       .       -       .       gene_id "LOC120102500"; transcript_id ""; db_xref "GeneID:120102500"; gbkey "Gene"; gene "LOC120102500"; gene_biotype "lncRNA"; 
--
NC_051339.1     BestRefSeq      exon    145599561       145602098       .       -       .       gene_id "Oxtr"; transcript_id "NM_012871.3"; db_xref "GeneID:25342"; exception "annotated by transcript or proteomic data"; gene "Oxtr"; inference "similar to RNA sequence, mRNA (same species):RefSeq:NM_012871.3"; note "The RefSeq transcript aligns at 85% coverage compared to this genomic sequence"; partial "true"; product "oxytocin receptor"; transcript_biotype "mRNA"; exon_number "2"; 
NC_051339.1     BestRefSeq      exon    145598549       145599560       .       -       .       gene_id "Oxtr"; transcript_id "NM_012871.3"; db_xref "GeneID:25342"; exception "annotated by transcript or proteomic data"; gene "Oxtr"; inference "similar to RNA sequence, mRNA (same species):RefSeq:NM_012871.3"; note "The RefSeq transcript aligns at 85% coverage compared to this genomic sequence"; partial "true"; product "oxytocin receptor"; transcript_biotype "mRNA"; exon_number "3"; 
NC_051339.1     BestRefSeq      CDS     145613641       145614559       .       -       0       gene_id "Oxtr"; transcript_id "NM_012871.3"; db_xref "GeneID:25342"; gbkey "CDS"; gene "Oxtr"; product "oxytocin receptor"; protein_id "NP_037003.2"; exon_number "1"; 
--
NC_051340.1     BestRefSeq      exon    105962787       105963672       .       -       .       gene_id "Elavl2"; transcript_id "NM_001302217.1"; db_xref "GeneID:286973"; exception "annotated by transcript or proteomic data"; gene "Elavl2"; inference "similar to RNA sequence, mRNA (same species):RefSeq:NM_001302217.1"; note "The RefSeq transcript aligns at 88% coverage compared to this genomic sequence"; partial "true"; product "ELAV like RNA binding protein 2"; transcript_biotype "mRNA"; exon_number "8"; 
NC_051340.1     BestRefSeq      exon    105960872       105962786       .       -       .       gene_id "Elavl2"; transcript_id "NM_001302217.1"; db_xref "GeneID:286973"; exception "annotated by transcript or proteomic data"; gene "Elavl2"; inference "similar to RNA sequence, mRNA (same species):RefSeq:NM_001302217.1"; note "The RefSeq transcript aligns at 88% coverage compared to this genomic sequence"; partial "true"; product "ELAV like RNA binding protein 2"; transcript_biotype "mRNA"; exon_number "9"; 
NC_051340.1     BestRefSeq      CDS     106024499       106024570       .       -       0       gene_id "Elavl2"; transcript_id "NM_001302217.1"; db_xref "GeneID:286973"; gbkey "CDS"; gene "Elavl2"; product "ELAV-like protein 2"; protein_id "NP_001289146.1"; exon_number "2"; 
--
NC_051349.1     BestRefSeq      exon    24007508        24007663        .       +       .       gene_id "Epha5"; transcript_id "NM_001169137.2"; db_xref "GeneID:79208"; exception "annotated by transcript or proteomic data"; gene "Epha5"; inference "similar to RNA sequence, mRNA (same species):RefSeq:NM_001169137.2"; note "The RefSeq transcript aligns at 95% coverage compared to this genomic sequence"; partial "true"; product "EPH receptor A5, transcript variant 1"; transcript_biotype "mRNA"; exon_number "17"; 
NC_051349.1     BestRefSeq      exon    24015801        24017317        .       +       .       gene_id "Epha5"; transcript_id "NM_001169137.2"; db_xref "GeneID:79208"; exception "annotated by transcript or proteomic data"; gene "Epha5"; inference "similar to RNA sequence, mRNA (same species):RefSeq:NM_001169137.2"; note "The RefSeq transcript aligns at 95% coverage compared to this genomic sequence"; partial "true"; product "EPH receptor A5, transcript variant 1"; transcript_biotype "mRNA"; exon_number "18"; 
NC_051349.1     BestRefSeq      exon    24017318        24020135        .       +       .       gene_id "Epha5"; transcript_id "NM_001169137.2"; db_xref "GeneID:79208"; exception "annotated by transcript or proteomic data"; gene "Epha5"; inference "similar to RNA sequence, mRNA (same species):RefSeq:NM_001169137.2"; note "The RefSeq transcript aligns at 95% coverage compared to this genomic sequence"; partial "true"; product "EPH receptor A5, transcript variant 1"; transcript_biotype "mRNA"; exon_number "19"; 
--
NC_051352.1     BestRefSeq      exon    23500942        23501702        .       -       .       gene_id "Smim13"; transcript_id "NM_001135576.2"; db_xref "GeneID:690806"; exception "annotated by transcript or proteomic data"; gene "Smim13"; inference "similar to RNA sequence, mRNA (same species):RefSeq:NM_001135576.2"; note "The RefSeq transcript has 2 substitutions, 1 non-frameshifting indel and aligns at 95% coverage compared to this genomic sequence"; partial "true"; product "small integral membrane protein 13"; transcript_biotype "mRNA"; exon_number "2"; 
NC_051352.1     BestRefSeq      exon    23497662        23500941        .       -       .       gene_id "Smim13"; transcript_id "NM_001135576.2"; db_xref "GeneID:690806"; exception "annotated by transcript or proteomic data"; gene "Smim13"; inference "similar to RNA sequence, mRNA (same species):RefSeq:NM_001135576.2"; note "The RefSeq transcript has 2 substitutions, 1 non-frameshifting indel and aligns at 95% coverage compared to this genomic sequence"; partial "true"; product "small integral membrane protein 13"; transcript_biotype "mRNA"; exon_number "3"; 
NC_051352.1     BestRefSeq      CDS     23515512        23515587        .       -       0       gene_id "Smim13"; transcript_id "NM_001135576.2"; db_xref "GeneID:690806"; gbkey "CDS"; gene "Smim13"; product "small integral membrane protein 13"; protein_id "NP_001129048.1"; exon_number "1";

It seems that in all cases these are two neighbour exons without an intron in between. Maybe the script should merge them in such cases?

kim-fehl avatar Feb 05 '22 20:02 kim-fehl

Hi, @kim-fehl, Would you mind trying again with the development version of ngs-tools? You can install it by running

pip install git+https://github.com/Lioscro/ngs-tools@devel

and then running kb ref again.

Lioscro avatar Feb 06 '22 20:02 Lioscro