Chado icon indicating copy to clipboard operation
Chado copied to clipboard

Traceability of the source analysis or project

Open colthom opened this issue 5 years ago • 0 comments

Hello,

In recent years, I have deployed Tripal-based websites (and so Chado) for different agronomical research projects. Both Chado and Tripal proved to be satisfactory tools. However, for our needs, Chado has some weaknesses in one domain : traceability.

After reviewing various issues submitted here, I found some that could be related to this :

  • Annotating a feature with multiple types of cvterms #75
  • Changes to the Project and Biomaterial table #41
  • direct way to link analysis and biomaterial? #77
  • organism_analysis linker table #59
  • suggested mapping for NCBI databases #76
  • project_phenotype table? #62

Traceability of the source analysis/project

With our data, we encounter this problem concerning all kinds of feature annotations (be it cross references, annotations with controlled terms, properties or even relationships with other features). In each case, we would like to keep track of which analysis assigned which annotation. I feel that this is a necessary information. Plus, it allows to update annotations when a better analysis is run : the system can detect which annotation is obsolete.

. However, currently, a link between an 'analysis' and, for example, a 'feature<->cvterm' association is not directly available, as it goes through the 'feature' table, and hence lose the 1-1 relationship. analysis -- featureanalysis -- feature -- feature_cvterm

There are workarounds for this problem, like in the Tripal Analysis Interpro/Kegg/GO modules. These modules use the 'analysis_featureprop' table to store annotation results, in parsable format. Additionally, some use the 'analysis_featureprop.type_id' field (which is a cvterm_id) to link to an annotation (GO term, or IPR domain). This use is effective (at least for cvterm annotation), but doesn't seem 'best practices' compliant.

. As I see it, regarding links between an annotation and its source analysis, we can have 3 cases :

  • Each feature<->[annotation] association can only have ONE source analysis/project. In this case, an additional field 'analysis_id' (or 'project_id') in the feature_annotation linker table could be enough.
  • Each feature<->[annotation] association could be assigned by multiple analysis. In this case, a new linker table like 'feature_[annotation]_analysis' is required.
  • Each feature<->[annotation] association could be assigned by multiple analysis, but with a primary one. In this case, both a new "analysis_id" field and a new linker table could be considered. (I didn't experience this case with our feature annotations, but it could be found on a larger scale with other types of data, and it provide flexibility.)

. We experimented with the addition of "[table] _analysis" linker tables in our system. They were all of the following format :

Column Type Modifiers Constraints
[table] _analysis_id serial NOT NULL primary key
[table] _id int NOT NULL foreign key ([table])
analysis_id int NOT NULL foreign key (analysis)
association_type_id int Default=1 foreign key (cvterm)

Unique key : [table] _id + analysis_id

Where 'association_type_id' could refers to :

  • a simple additional information on the association, like the class of annotation this one fall into ("Domain", "Gene Family", "Pathway", ...)
  • or, more web semantic friendly, the type of association/relationship, like an RDF "predicate" (for example here, "contains", "is_member_of", "involved_in", "in homology relationship with", ...) (cf "relationship type" in TAIR database and issue #75)

Below, an example of possible result : on a Tripal2 mRNA page, a new column can provide information on the annotation origin.

Example_annotation_analysis

.

I presented here the traceability problem we encountered with feature annotations, but this is a much larger question, for every object and information that can originate from an analysis or a project.

Metadata, Minimum Information Standards and FAIR principles compliance

On an even larger scale, I don't know if there is a common reflection on how to stock metadata, or all information of provenance, in Chado.

Nowadays, biological data are expected to be accompanied with complete metadata, following the FAIR principles. Minimum information guidelines are provided by their communities for each kind of experiment (https://en.wikipedia.org/wiki/Minimum_Information_Standards). More and more publishers require those metadata.

It could be useful to provide a way to store this information in Chado, and encourage users to use it, even at a very basic level. Keeping track of the experiment or process that produced every data seem to be one of the first steps.

Metadata can take various shapes. One of the main project aiming to provide a standardized model for metadata is the ISA common initiative. Their ISA data model is a complete metadata framework, and is based on a 3 level concept : 'Investigation' (the project context), 'Study' (a unit of research) and 'Assay' (analytical measurement). A Study as associated Assay. Every Assay of a Study are run on the same biological material, or a subpart of it. https://isa-tools.org/format/specification.html The ISA data model is used in the Nature's 'Scientific Data' project : http://scientificdata.isa-explorer.org/, and, as inspiration, in the SEEK platform and its public instance, FAIRDOMHub : https://fairdomhub.org/

There is maybe something to do with the "analysis" and "project" components of Chado, and the "Assay" and "Study" concept of ISA.

. I'm sure you've already thought about this kind of issue. Are there clear guidelines related to the best use of "analysis" and "project" concepts in Chado ? For example, could an "analysis" take a more generic role than originally intended within the Comparanalysis module, and become the universal object to represent an experiment producing data ? And could a "project" be a way to group associated analyses ? (And maybe be linked directly to biological material, as in the ISA model.)

Or could part of the MAGE module, which already implement the "Assay" and "Study" concepts, be more broadly used, as suggested in issue #41 ? But wouldn't "Analysis" and "Assay" concepts be kind of redundant ?

Of course, this subject has a lot more ramifications, like the management of contact information, or the description of the process and experience in a more "workflow" detailed way.

Sorry for the long post, maybe I should have divided it. But I would be very happy to read your thoughts on those subjects !!

colthom avatar Jun 16 '19 19:06 colthom