metAMOS
metAMOS copied to clipboard
ORFs sequences not found
Hello, I've been doing some test with 454 data. I'm running metAMOS with metavelvet and fraggenescan, and stop at the FindORFS step. I initially wanted to use MetaGeneMark but it's not available for now. In the outputs of FindORFs, there is no proba.orf.faa/fna. I guess I could make a script to create it from proba.ctg.fna and proba.orfs but I don't understand the logic of these files. For instance, what do the nodes represent? And why are they cut in several sequences?
NODE_112_length_527_cov_1.609108_17_274_- GCGGAAGTGTTGCAAATTTGGGTGGATGGCAGTTGTACCGGCAATCAAAACCAGCCGGGCAAGATTGCGGGTCCTTTGCCGAAAGCCCGGTCAGCCGCTTATTTTCCCAGTCTTCAAATAGGCCGCAGTTGTGATTTACCGGGGGCAACACAAACCAACATTCGAGCTGAATTGTTTGCGGTCTTGTTAGCCTTTGCGGAATTAGAGCGAATGGGTATTCAAGCCGGCCATCTCGAATTTTTACCGATTGTT NODE_112_length_527_cov_1.609108_280_557_- CGATTGCCTTGCAAAAGTTACAAATCGAGACTCTCAAGTCACAAAATGTCCCAGCCGCCGCATAATGTCTCCGATCCGGAAGAAGTCGAAGTCGCTCAAGATTTGATTACAAAACTGGTCATCCGCGCCGACGTCGAGTTATCCGGCTTTCAACAACGCGTGGAATCCCGGCTCCAAGAACAAATCGAAGAATTGCGGACGGAAAACCGGATCTTGGCGGTTGGCGTTGGAATTGCCCTTTTACTGGGTCTTGTCCGGTGGGCTTCATGTCTT
Some nodes appear only once but the sequence doesn't have the lenght written:
NODE_2_length_63_cov_4.253968_1_93_+ ATCATGGCAAATATGGGATTGATTTTACCTAAAAATAATTTCCTCTCTGAGGTCAGAAAGATCACAAAGGACAATGATATCCCTCTAATT (this one is longer than 90bp)
I hope I'm being clear..
The proba.orf.faa/fna files are only available once the pipeline completes in Postprocess/out. In FindORFS/out they are named just proba.faa/fna. (The files proba.faa/fna and proba.ctg.fna/faa should be the same unless you turn on ORF calling on sequences which is off by default).
You do not need to do any further parsing of the files. The names are generated by FragGeneScan and are constructed as
Ok, but then, how is it possible to find ORFs of lenght superior to contig's lenght ? And why, in the Assemble/out, do I find files with similar header: (proba.fna/metavelvet31.fna)
NODE_14_length_151_cov_1.000000_1_181_+ even if at this step of the pipeline the ORFs haven't been searched yet? In this directory, contigs.fa/meta-velvetg.contigs.fa seem to be the contigs sequences, but the lenghts don't match between the header and the sequence itself. I'm concerned because I want to write a script that selects the contigs of lenght superior to 300 but it seems like I can't trust the lenght written in the header so I will have to count the bases of each sequence..
Don't hesitate to tell me if I'm asking inapropriate questions and should document myself elsewhere (looking for metavelvet manual or so)!
The fna file in the Assemble/out directory is also the same as the FindORFS/proba.fna. The same file is hard linked to by several places within metAMOS's internal directories. The assembly with no ORF calling is in Assemble/out/proba.asm.contig
As far as the length of the gene call versus the contig, it is possible one of the programs (FragGeneScan or MetaVelvet) is not reporting the size correctly. Have you checked the size of the entry in the fna file to see if it matches the Velvet length or the FragGeneScan length?