sga icon indicating copy to clipboard operation
sga copied to clipboard

scaffold2fasta: Assertion `expectedOverlap < (int)contigString.length()' failed

Open sjackman opened this issue 9 years ago • 18 comments

Any suggestions?

sga scaffold2fasta -m 500 --write-unplaced --use-overlap -o hsapiens-scaffolds.fa -a hsapiens-graph.asqg.gz hsapiens.scaf
Graph best: 0
Graph unique: 1
Find overlaps: 4
Deleted edges for 0 super repetitive vertices
Warning: removed 2900 duplicate edges
sga: ScaffoldRecord.cpp:317: bool ScaffoldRecord::introduceGap(int, const string&, const ScaffoldLink&, std::__cxx11::string&) const: Assertion `expectedOverlap < (int)contigString.length()' failed.
Vertices: 6703921 Edges: 28309816 Islands: 136547 Tips: 625936 Monobranch: 3553099 Dibranch: 1249436 Simple: 5058208

sjackman avatar Jul 30 '16 00:07 sjackman

I haven't seen that one before - did you use the -mean setting for DistanceEst?

jts avatar Jul 30 '16 02:07 jts

I did at first, and I had suspected that's what had gone wrong. But, no, I repeated the estimates with --mle and got the same error.

sjackman avatar Jul 30 '16 03:07 sjackman

I'm at a bit of a loss - could you send the .scaf file and .asqg.gz (or contig fasta, if the error appears with that input too). I won't be able to look until next week though.

jts avatar Jul 30 '16 18:07 jts

Here you go:

  • ftp://ftp.bcgsc.ca/public/sjackman/hsapiens-graph.asqg.gz
  • ftp://ftp.bcgsc.ca/public/sjackman/hsapiens.scaf
sga scaffold2fasta -m 500 --write-unplaced --use-overlap -o hsapiens-scaffolds.fa -a hsapiens-graph.asqg.gz hsapiens.scaf

sjackman avatar Jul 31 '16 16:07 sjackman

Any chance you've had some time to look into this issue, Jared?

sjackman avatar Oct 27 '16 20:10 sjackman

Sorry, no. My time is short these days I'm afraid. Is this blocking you from something?

jts avatar Oct 28 '16 13:10 jts

No worries. I am sympathetic to time constraints. We wanted to include SGA in an assembly comparison of GIAB HG004. SGA will be included in the comparison of contigs. Including scaffold results hinges on this issue.

sjackman avatar Oct 28 '16 15:10 sjackman

What's the format of the .scaf file? Perhaps I could use MergeContigs from ABySS as a hack work around to create the scaffolds FASTA file.

contig-6181580  contig-4954829,228,151.2,0,0,D  contig-197051,2178,122.2,0,1,D

sjackman avatar Nov 01 '16 21:11 sjackman

The format is:

std::ostream& operator<<(std::ostream& out, const ScaffoldLink& link)
{
    out << link.endpointID << "," << link.distance << "," << link.stdDev << ","
        << link.edgeData.getDir() << "," << link.edgeData.getComp() << "," << link.getTypeCode();
    return out;
}

https://github.com/jts/sga/blob/master/src/Scaffold/ScaffoldLink.cpp#L118

sjackman avatar Nov 01 '16 21:11 sjackman

Could I convert this to an ABySS path file like so, or is it not quite so simple? How do I convert the Dir and Comp components to a + / - orientation?

1    contig-6181580+ 100N contig-4954829+ 100N contig-197051-

sjackman avatar Nov 01 '16 21:11 sjackman

Is the .scaf file the final list of which contigs go in which scaffolds with one scaffold per line, or is it a list of distance estimates that still need to be merged into larger scaffolds?

sjackman avatar Nov 02 '16 00:11 sjackman

The former - it is the final list of contigs that go into scaffolds.

jts avatar Nov 02 '16 00:11 jts

Can you give me a brief description of how to convert a SGA scaf file to an ABySS path file? If you are able to convert the following .scaf record manually, I can figure it out and write the converter.

contig-1297628  contig-1106169,-24,25.3,0,1,D   contig-4496238,-47,17.3,1,1,D   contig-2839914,-92,242.9,0,1,D  contig-1247260,-62,97.3,1,1,D   contig-2437944,3840,177,0,0,D   contig-368915,112,186.4,0,1,D   contig-4976804,-306,22.8,1,0,D  contig-3455841,371,216.8,1,1,D  contig-3969805,201,94.5,0,0,D   contig-742829,-92,135.6,0,1,D   contig-3111798,11,28.9,1,1,D    contig-1750901,-79,27.5,0,0,D

Uses cases like this one are why I so badly want to a GFA standard! It would be great if scaffold2fasta and MergeContigs accepted the same file formats and were interchangeable. Please do comment on the proposed GFA 2 spec if you find the time. The format of the path record is one of the file items to nail down. https://github.com/GFA-spec/GFA-spec/pull/48

sjackman avatar Nov 02 '16 16:11 sjackman

Hi Shaun,

Agreed about GFA2.

I describe the file format here:

https://groups.google.com/forum/#!msg/sga-users/0pqCJwpPe8A/RRvCVPKU25QJ;context-place=forum/sga-users

The orientation bit describes whether the scaffold should be built left-to-right (the first contig in the record is the first contig in the scaffold) or right-to-left (the first contig is the last record). See here:

https://github.com/jts/sga/blob/master/src/Scaffold/ScaffoldRecord.cpp#L45

I suggest you don't spend much time on this - SGA is no longer under active development and there are better short read assemblers out there.

Sorry I can't help more!

Jared

jts avatar Nov 02 '16 17:11 jts

No worries, Jared. Quick comments are helpful. We're including SGA in this comparison to ABySS 2.0 due to its low memory usage. We're also comparing to DISCOVARdenovo, and it yields great contiguity, but takes a lot of memory and time.

From the code it appears that only the direction bit of the first link matters, and the rest are ignored. Is that correct? https://github.com/jts/sga/blob/master/src/Scaffold/ScaffoldRecord.cpp#L60

sjackman avatar Nov 02 '16 17:11 sjackman

Yes I believe that is correct but verify :)

On Wed, Nov 2, 2016 at 1:48 PM, Shaun Jackman [email protected] wrote:

No worries, Jared. Quick comments are helpful. We're including SGA in this comparison due to its low memory usage.

From the code it appears that only the direction bit of the first link matters, and the rest are ignored. Is that correct? https://github.com/jts/sga/blob/master/src/Scaffold/ScaffoldRecord.cpp#L60

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jts/sga/issues/125#issuecomment-257943999, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXxn0Uo125sAzXYLlyxrrUIKqEG5onfks5q6MzWgaJpZM4JYrqM .

jts avatar Nov 02 '16 17:11 jts

Here's a hacktastic script to convert a SGA .scaf file to an ABySS .path file. Thanks for your help, Jared.

#!/bin/sh
set -eu
exec gawk -e '
        !/\t/ { next }
        {
            # Determine whether the contigs are in reverse order.
            reverse = /^[^\t]*\t[^\t]*,1,[01],D/
            # Orient the contigs.
            $1 = $1 "+"
            rc = 0
            for (i = 2; i <= NF; ++i) {
                if ($i ~ /,1,D/)
                    rc = !rc
                sub(/,.*/, rc ? "-" : "+", $i)
            }
            # Print the contig IDs and orientations.
            printf "%u", id++
            if (reverse) {
                for (i = NF; i > 0; --i)
                    printf " %s", $i
                printf "\n"
            } else {
                print " " $0
            }
        }' "$@" \
    | gsed -e 's/ /\t/;s/ / 199N /g'

sjackman avatar Nov 02 '16 22:11 sjackman

Great, thanks for the script.

jts avatar Nov 03 '16 22:11 jts