SO-Ontologies
SO-Ontologies copied to clipboard
SO-compliant annotation of mobile element integration sites
Hi all, I have a question and/or suggestion regarding the annotation of mobile element integration sites. We are developing de novo transposable element (TE) annotation software which output their results in GFF3 format. In particular we determine and output terminal repeats (direct or inverted), target site duplications, ORFs, and homology-based features such as pHMM matches in the region between the terminal repeats. On the basis of these annotations we use our feature graph processing infrastructure to further enhance and improve the results. In order to ensure full compliance to the SO in terms of relationship compatibility, I am now wondering how to integrate all these data into a feature graph. The main problem is the question of where to put the target_site_duplications (TSD, SO:0000434) flanking the TE insertion. Obviously they are not part of the integrated element itself, but they are -- in my opinion -- still connected to the particular integration site and should be connected to it in some way. For now we are outputting the TSDs and the element annotation as children of a repeat_region feature, e.g. in the case of LTR retrotransposons: repeat_region (SO:0000657) -- target_site_duplication (SO:0000434) -- LTR_retrotransposon (SO:0000186) -- long_terminal_repeat (SO:0000286) -- ... which is not really SO compliant yet. I see that there is a derives_from relationship between the target_site_duplication and the transposable_element, but in GFF3 only part_of relationships are the basis for parent-child assignments, so the TSDs would not be part of the connected component representing one integrated element. Is there an alternative? I could not find any SO type which represents an insertion site in a structural way, capturing both the inserted element and its effect on the integration site via a part_of relationship, e.g.: transposable_element_integration_site (new type) -- target_site_duplication (SO:0000434) -- LTR_retrotransposon (SO:0000186) -- long_terminal_repeat (SO:0000286) -- ... or, respectively, transposable_element_integration_site (new type) -- target_site_duplication (SO:0000434) -- terminal_inverted_repeat_element -- terminal_inverted_repeat -- ... Another question is how to handle matches, e.g. protein_match (SO:0000349) or ORFs (SO:0000236) correctly. As far as I can see, the only way to have information about internal functional or coding regions attached to a transposon annotation is via the transposable_element_gene (SO:0000111) type. However, it is not always possible to reconstruct genes from such matches, particularly in degenerated old insertions. Nevertheless we would like to store the matches with the predicted elements to allow later postprocessing (e.g. filtering etc.) on the basis of these matches. Is there a preferred way to handle this? For both cases (TSDs and matches) the obvious way would be to keep them as top-level features, which would lead to a need to combine these individual features again in our iterative pipeline. This pipeline delivers and processes one connected component from e.g. an input GFF3 file at a time, which we naturally would prefer to be one complete integrated element with all associated information. I am very much looking forward to your input, thanks in advance! Best regards, Sascha This was followed on the mailing list by: Hello Sascha, I'm happy to hear that other folks are wrestling with how to represent transposable elements in SO compatible GFF3. I agree that there needs to be some tweaking of SO to more correctly capture the structural biology of transposable elements. In many eukaryotic genomes, the majority of the sequence features are transposable elements and not being able to communicate these features in SO compliant GFF3 is a frustration. It seems to me like it should be possible to recognize that a transposable element is a genome that is itself a component of a parent (host) genome. In turn the transposable element itself can serve as a host for a separate transposable element insertion (ie. an LTR retrotransposon inserted into another LTR retrotransposon). Something like this would allow transposable elements to have genes/ORFs or alignments that could be annotated as children of an appropriately defined transposable element genome. Since target site duplications are derived from the host genome that the mobile element inserts into, it makes sense that these remain a derives_from feature that are part_of the host genome that was inserted into. In some situations the host genome would be the parent host (ie rice or maize) while in others the host in the derives from relationship would be another transposable element that was inserted into.