ray icon indicating copy to clipboard operation
ray copied to clipboard

Cannot use contigs from another assembler

Open yaximik opened this issue 11 years ago • 10 comments

Contigs were created by minia with max contig length 16091 nt. When these contigs (fasta) were used as input to Ray 2.1.0 along with other PE and SE reads (fastq), Ray exited with segmentation error. When minia contigs were not included, the same data set was assembled successfully with max contig/scaffold length of 46428 nt. I am not sure it is a bug or not, but think it would be helpful if contigs from other assemblers could be included. Here is last output to stdout:

........ Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SC-MILLib1-Herc2s10cFr1Fr2run2R1AdQ30.fastq (please wait...) [Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SC-MILLib1-Herc2s10cFr1Fr2run2R1AdQ30.fastq (please wait...) [Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SCPfx3s25cFr3-150-200run1R1AdQ30.fastq (please wait...) [Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SCPfx3s25cFr3-150-200run1R1AdQ30.fastq (please wait...) [Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SCPfx3s25cFr3-150-200run2R1AdQ30.fastq (please wait...) [Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SCPfx3s25cFr3-150-200run2R1AdQ30.fastq (please wait...) [Loader::load] File: /media/FantomHD/AssRefMap/SC/SCold/SColdAll.fasta (please wait...) [Loader::load] File: /media/FantomHD/AssRefMap/SC/SCold/SColdAll.fasta (please wait...) [Loader::load] File: /media/FantomHD/AssRefMap/SC/SCold/SCallSanger.fasta (please wait...) [Loader::load] File: /media/FantomHD/AssRefMap/SC/SCold/SCallSanger.fasta (please wait...) [Loader::load] File: /home/yaximik/AssRefMap/SC/minia/SCMiSeqAllFGMGPGIGclean_k27.contigs.fasta (please wait...) [G5NNJN1:07040] *** Process received signal *** [G5NNJN1:07040] Signal: Segmentation fault (11) [G5NNJN1:07040] Signal code: (128)

[G5NNJN1:07040] Failing at address: (nil)

mpiexec noticed that process rank 0 with PID 7040 on node G5NNJN1 exited on signal 11 (Segmentation fault).

yaximik avatar Mar 12 '13 15:03 yaximik

The problem is due to a lack of support for multi-line fasta files.

=> code/plugin_SequencesLoader/FastaLoader.cpp

sebhtml avatar Mar 14 '13 13:03 sebhtml

Does this show where problem is (the code segment) or points to a replacement file with the fix? If the problem is in this code segment, I presume it can be circumvented by eliminating newlines from fasta entries (if I can find a tool or create a script).

yaximik avatar Mar 14 '13 13:03 yaximik

The problem has to be fixed in the code. I just pointed (to myself) where the change need to be done.

I will look into that once I have closed these issues: #153 #146 #136

So you should reopen the issue.

sebhtml avatar Mar 14 '13 14:03 sebhtml

Evaluation: 5 human-hours

sebhtml avatar Apr 17 '13 15:04 sebhtml

The user used v2.1.0. Last stable is v2.2.0.

sebhtml avatar Aug 27 '13 14:08 sebhtml

Merge this patch in master:

wget https://raw.github.com/sebhtml/patches/master/ray/Ray-v2.2.0-fix-for-amos-and-long-reads.patch

sebhtml avatar Aug 27 '13 14:08 sebhtml

To fix this, multi-line fasta support is required.

sebhtml avatar Aug 28 '13 13:08 sebhtml

https://github.com/sebhtml/ray/commit/20d20c1cef4136281b46e14115631a2326920550

sebhtml avatar Aug 28 '13 13:08 sebhtml

Just in case it helps, here's my Java code for loading multi-line fasta/fastq files:

try {
  System.err.print("Loading from file...");
  boolean isFastQ = false;
  BufferedReader f = new BufferedReader(new FileReader(fileName));
  String nextLine = f.readLine();
  if(nextLine.startsWith("@")){
    isFastQ = true;
  }
  while(nextLine != null){
    String sequenceName = nextLine.substring(1);
    sequenceIDs.put(nextSequenceID++, sequenceName);
    StringBuilder dnaSeq = new StringBuilder();
    nextLine = f.readLine();
    while((nextLine != null) && (!nextLine.startsWith(isFastQ ? "+" : ">"))){
      dnaSeq.append(nextLine);
      nextLine = f.readLine();
    }
    if(isFastQ){
      StringBuilder qual = new StringBuilder();
      do {
        nextLine = f.readLine();
        qual.append(nextLine);
      } while(qual.length() < dnaSeq.length());
      nextLine = f.readLine();
    }
    addSequence(dnaSeq);
  }
  f.close();
  System.err.println(" done!");
} catch (FileNotFoundException e) {
  e.printStackTrace();
  throw(new RuntimeException("File not found"));
} catch (IOException e) {
  e.printStackTrace();
  throw(new RuntimeException("IO exception reading or closing file"));
}

In human-readable terms:

  • read the > or @ symbol for the sequence ID ('>' FASTA, '@' FASTQ)
  • read the next lines until you see a > or + symbol, recording how many bases you've seen
  • read the next lines until you've read the same number of characters as the sequence bases. This deals with '>', '+', and '@' symbols in the quality string
  • wash, rinse, repeat

gringer avatar Sep 16 '13 16:09 gringer

Thanks !

sebhtml avatar Sep 16 '13 16:09 sebhtml