ray
ray copied to clipboard
Cannot use contigs from another assembler
Contigs were created by minia with max contig length 16091 nt. When these contigs (fasta) were used as input to Ray 2.1.0 along with other PE and SE reads (fastq), Ray exited with segmentation error. When minia contigs were not included, the same data set was assembled successfully with max contig/scaffold length of 46428 nt. I am not sure it is a bug or not, but think it would be helpful if contigs from other assemblers could be included. Here is last output to stdout:
........ Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SC-MILLib1-Herc2s10cFr1Fr2run2R1AdQ30.fastq (please wait...) [Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SC-MILLib1-Herc2s10cFr1Fr2run2R1AdQ30.fastq (please wait...) [Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SCPfx3s25cFr3-150-200run1R1AdQ30.fastq (please wait...) [Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SCPfx3s25cFr3-150-200run1R1AdQ30.fastq (please wait...) [Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SCPfx3s25cFr3-150-200run2R1AdQ30.fastq (please wait...) [Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SCPfx3s25cFr3-150-200run2R1AdQ30.fastq (please wait...) [Loader::load] File: /media/FantomHD/AssRefMap/SC/SCold/SColdAll.fasta (please wait...) [Loader::load] File: /media/FantomHD/AssRefMap/SC/SCold/SColdAll.fasta (please wait...) [Loader::load] File: /media/FantomHD/AssRefMap/SC/SCold/SCallSanger.fasta (please wait...) [Loader::load] File: /media/FantomHD/AssRefMap/SC/SCold/SCallSanger.fasta (please wait...) [Loader::load] File: /home/yaximik/AssRefMap/SC/minia/SCMiSeqAllFGMGPGIGclean_k27.contigs.fasta (please wait...) [G5NNJN1:07040] *** Process received signal *** [G5NNJN1:07040] Signal: Segmentation fault (11) [G5NNJN1:07040] Signal code: (128)
[G5NNJN1:07040] Failing at address: (nil)
mpiexec noticed that process rank 0 with PID 7040 on node G5NNJN1 exited on signal 11 (Segmentation fault).
The problem is due to a lack of support for multi-line fasta files.
=> code/plugin_SequencesLoader/FastaLoader.cpp
Does this show where problem is (the code segment) or points to a replacement file with the fix? If the problem is in this code segment, I presume it can be circumvented by eliminating newlines from fasta entries (if I can find a tool or create a script).
The problem has to be fixed in the code. I just pointed (to myself) where the change need to be done.
I will look into that once I have closed these issues: #153 #146 #136
So you should reopen the issue.
Evaluation: 5 human-hours
The user used v2.1.0. Last stable is v2.2.0.
Merge this patch in master:
wget https://raw.github.com/sebhtml/patches/master/ray/Ray-v2.2.0-fix-for-amos-and-long-reads.patch
To fix this, multi-line fasta support is required.
https://github.com/sebhtml/ray/commit/20d20c1cef4136281b46e14115631a2326920550
Just in case it helps, here's my Java code for loading multi-line fasta/fastq files:
try {
System.err.print("Loading from file...");
boolean isFastQ = false;
BufferedReader f = new BufferedReader(new FileReader(fileName));
String nextLine = f.readLine();
if(nextLine.startsWith("@")){
isFastQ = true;
}
while(nextLine != null){
String sequenceName = nextLine.substring(1);
sequenceIDs.put(nextSequenceID++, sequenceName);
StringBuilder dnaSeq = new StringBuilder();
nextLine = f.readLine();
while((nextLine != null) && (!nextLine.startsWith(isFastQ ? "+" : ">"))){
dnaSeq.append(nextLine);
nextLine = f.readLine();
}
if(isFastQ){
StringBuilder qual = new StringBuilder();
do {
nextLine = f.readLine();
qual.append(nextLine);
} while(qual.length() < dnaSeq.length());
nextLine = f.readLine();
}
addSequence(dnaSeq);
}
f.close();
System.err.println(" done!");
} catch (FileNotFoundException e) {
e.printStackTrace();
throw(new RuntimeException("File not found"));
} catch (IOException e) {
e.printStackTrace();
throw(new RuntimeException("IO exception reading or closing file"));
}
In human-readable terms:
- read the > or @ symbol for the sequence ID ('>' FASTA, '@' FASTQ)
- read the next lines until you see a > or + symbol, recording how many bases you've seen
- read the next lines until you've read the same number of characters as the sequence bases. This deals with '>', '+', and '@' symbols in the quality string
- wash, rinse, repeat
Thanks !