intermine icon indicating copy to clipboard operation
intermine copied to clipboard

Let's talk expression data!

Open sammyjava opened this issue 4 years ago • 12 comments

As per the discussion in the community call today, I am hereby creating an "issue" to stimulate creation of a core expression data model so we can build tools that are commonly usable. (Note that Strain is implemented since we load expression data from multiple strains of a given legume species.)

To get things started, here's the data model that I use in the LIS mines:

<class name="ExpressionSource" is-interface="true">
        <attribute name="unit" type="java.lang.String"/>
        <attribute name="primaryIdentifier" type="java.lang.String"/>
        <reference name="dataSet" referenced-type="DataSet"/>
        <collection name="samples" referenced-type="ExpressionSample" reverse-reference="source"/>
</class>

<class name="ExpressionSample" is-interface="true">
        <attribute name="num" type="java.lang.Integer"/>
        <attribute name="description" type="java.lang.String"/>
        <attribute name="bioSample" type="java.lang.String"/>
        <attribute name="name" type="java.lang.String"/>
        <attribute name="primaryIdentifier" type="java.lang.String"/>
        <reference name="organism" referenced-type="Organism"/>
        <reference name="strain" referenced-type="Strain"/>
        <reference name="dataSet" referenced-type="DataSet"/>
        <reference name="source" referenced-type="ExpressionSource" reverse-reference="samples"/>
</class>

<class name="ExpressionValue" is-interface="true">
        <attribute name="value" type="java.lang.Double"/>
        <reference name="gene" referenced-type="Gene"/>
        <reference name="sample" referenced-type="ExpressionSample"/>
</class>

sammyjava avatar Mar 12 '20 19:03 sammyjava

And here's a sample of data that is held by that data model:

id                | 38202165
unit              | TPM
primaryidentifier | Gene expression atlas of pigeonpea Asha(ICPL87119)
datasetid         | 38202163
class             | org.intermine.model.bio.ExpressionSource
-[ RECORD 1 ]-----+---------------------------------------------------------
num               | 1
description       | Mature seed at Reproductive stage (SRR5199304)
id                | 37000003
biosample         | SAMN06264156
name              | Mature seed at reprod (SRR5199304)
primaryidentifier | SRR5199304
organismid        | 5235944
strainid          | 5235945
datasetid         | 38202163
sourceid          | 38202165
class             | org.intermine.model.bio.ExpressionSample
-[ RECORD 2 ]-----+---------------------------------------------------------
num               | 2
description       | Immature seed at Reproductive stage (SRR5199305)
id                | 37000005
biosample         | SAMN06264155
name              | Immature seed at reprod (SRR5199305)
primaryidentifier | SRR5199305
organismid        | 5235944
strainid          | 5235945
datasetid         | 38202163
sourceid          | 38202165
class             | org.intermine.model.bio.ExpressionSample
-[ RECORD 3 ]-----+---------------------------------------------------------
num               | 3
description       | Mature pod at Reproductive stage (SRR5199306)
id                | 37000007
biosample         | SAMN06264154
name              | Mature pod at reprod (SRR5199306)
primaryidentifier | SRR5199306
organismid        | 5235944
strainid          | 5235945
datasetid         | 38202163
sourceid          | 38202165
class             | org.intermine.model.bio.ExpressionSample
-[ RECORD 1 ]---+----------------------------------------
intermine_value | 0
id              | 37000002
geneid          | 5235941
sampleid        | 37000003
class           | org.intermine.model.bio.ExpressionValue
-[ RECORD 2 ]---+----------------------------------------
intermine_value | 1.21
id              | 37000062
geneid          | 5235946
sampleid        | 37000003
class           | org.intermine.model.bio.ExpressionValue
-[ RECORD 3 ]---+----------------------------------------
intermine_value | 4.29
id              | 37000092
geneid          | 5235948
sampleid        | 37000003
class           | org.intermine.model.bio.ExpressionValue
-[ RECORD 4 ]---+----------------------------------------
intermine_value | 11.43
id              | 37000122
geneid          | 5235950
sampleid        | 37000003
class           | org.intermine.model.bio.ExpressionValue

sammyjava avatar Mar 12 '20 19:03 sammyjava

Thanks @sammyjava! @rachellyne and @sergiocontrino let's discuss during the next Monday meeting

danielabutano avatar Mar 13 '20 10:03 danielabutano

thank you @sammyjava, i think you are right that it would be nice to have a common basic model for this sooner rather than later. regarding the one you are using i have a few initial questions:

  • should the referenced type in ExpressionValue be a more generic bioentity?
  • what is the role of 'num' in ExpressionSample? is not the primaryidentifier enough?
  • maybe you could comment on using ExpressionSource, in particular this vs extending dataset and unit attribute here rather than in ExpressionValue (i think i can see the reasoning, would be nice to have your experience on that).
  • is this working for time-course experiments? thanks!

sergiocontrino avatar Mar 13 '20 11:03 sergiocontrino

Just saw this, Sergio!

* should the referenced type in ExpressionValue be a more generic bioentity?

Yes. No reason for it to specifically be Gene. I'm not sure about BioEntity, though, I think SequenceFeature would be more accurate. Proteins don't express but transposons can.

* what is the role of 'num' in ExpressionSample? is not the primaryidentifier enough?

That's for ordering the samples for user convenience, such as on a heat map axis. It's nice to have all the leaf-related tissues together and then the seed-related ones, etc. It can be left null.

* maybe you could comment on using ExpressionSource, in particular this vs extending dataset and unit attribute here rather than in  ExpressionValue (i think i can see the reasoning, would be nice to have your experience on that).

For extensibility. Although I agree that unit should reside with ExpressionValue, since it is the unit of that value (e.g. "TPM"). I extend ExpressionSource enormously in my mines, with all sorts of extra attributes. I don't think I want to extend DataSet with all that stuff. (Like SRA identifier, library prep details, all sorts of things that come up in RNA-seq experiments.)

* is this working for time-course experiments?

I haven't really thought of how to deal with time-course anything in InterMine. Of course you can add an attribute "time" to ExpressionValue and have values across time. I don't have any at the current time, although my first job in bioinformatics was dealing with Arabidopsis time-course experiments, for which I wrote a fairly big webapp.

sammyjava avatar Jun 12 '20 16:06 sammyjava

Updates per Sergio's suggestions.

<class name="ExpressionSource" is-interface="true">
        <attribute name="primaryIdentifier" type="java.lang.String"/>
        <reference name="dataSet" referenced-type="DataSet"/>
        <collection name="samples" referenced-type="ExpressionSample" reverse-reference="source"/>
</class>

<class name="ExpressionSample" is-interface="true">
        <attribute name="num" type="java.lang.Integer"/>
        <attribute name="description" type="java.lang.String"/>
        <attribute name="bioSample" type="java.lang.String"/>
        <attribute name="name" type="java.lang.String"/>
        <attribute name="primaryIdentifier" type="java.lang.String"/>
        <reference name="organism" referenced-type="Organism"/>
        <reference name="strain" referenced-type="Strain"/>
        <reference name="dataSet" referenced-type="DataSet"/>
        <reference name="source" referenced-type="ExpressionSource" reverse-reference="samples"/>
</class>

<class name="ExpressionValue" is-interface="true">
        <attribute name="unit" type="java.lang.String"/>
        <attribute name="value" type="java.lang.Double"/>
        <reference name="feature" referenced-type="SequenceFeature"/>
        <reference name="sample" referenced-type="ExpressionSample"/>
</class>

sammyjava avatar Jun 12 '20 16:06 sammyjava

FYI, this is my current model. I've got NCBI attributes in there (sra, bioProject, bioSample, geoSeries) as well as some others which should probably not be in the core model. But I thought I'd show you what I'm using. I also changed primaryIdentifier to identifier to make it clear that it doesn't extend Annotatable. (So publication and dataSet are explicitly listed as references.) Also, note that both organism and strain are referenced so that strain is not required.

<class name="ExpressionSource" is-interface="true">
        <attribute name="sra" type="java.lang.String"/>
        <attribute name="identifier" type="java.lang.String"/>
        <attribute name="geoSeries" type="java.lang.String"/>
        <attribute name="origin" type="java.lang.String"/>
        <attribute name="shortName" type="java.lang.String"/>
        <reference name="publication" referenced-type="Publication"/>
        <reference name="bioProject" referenced-type="BioProject"/>
        <reference name="dataSet" referenced-type="DataSet"/>
        <collection name="samples" referenced-type="ExpressionSample" reverse-reference="source"/>
</class>

<class name="ExpressionSample" is-interface="true">
        <attribute name="num" type="java.lang.Integer"/>
        <attribute name="identifier" type="java.lang.String"/>
        <attribute name="description" type="java.lang.String"/>
        <attribute name="bioSample" type="java.lang.String"/>
        <attribute name="name" type="java.lang.String"/>
        <reference name="organism" referenced-type="Organism"/>
        <reference name="dataSet" referenced-type="DataSet"/>
        <reference name="strain" referenced-type="Strain"/>
        <reference name="source" referenced-type="ExpressionSource" reverse-reference="samples"/>
</class>

<class name="ExpressionValue" extends="java.lang.Object" is-interface="false">
        <attribute name="value" type="java.lang.Double"/>
        <attribute name="unit" type="java.lang.String"/>
        <reference name="sample" referenced-type="ExpressionSample"/>
        <reference name="feature" referenced-type="SequenceFeature"/>
</class>

sammyjava avatar Jun 25 '20 16:06 sammyjava

I can't comment on implementation details, but perhaps one point worth considering:

Proteins don't express but transposons can.

What about protein-level expression data, e.g. from quantitative mass spec? You can have isoform-specific expression data that you couldn't capture properly on the gene level. Not sure if you want this to be "in scope" for the current proposal.

hendrikweisser avatar Oct 09 '20 09:10 hendrikweisser

Interesting. But I think it makes sense to limit the "things that express" to sequence features, which proteins are not. Just enforcing the "central dogma" really. I think you can map the proteins to transcript isoforms, and they often have the exact same name (gene.1, gene.2); they certainly do in all the LIS mines by specification. It certainly makes sense to store expression relative to transcripts, not genes.

sammyjava avatar Oct 09 '20 14:10 sammyjava

Apologies it has taken us a long time to get back to this. I really like the idea of a core expression model (or possibly a couple of core expression models to cover different expression techniques - see below). I think this needs some discussion and we probably should also take into account visualizations we already have (and what we would like to add). Our problem here in Cambridge is that we have multiple expression models that cover different data and techniques - RNA-seq, microarray and in-situ hybridisation. It would be good to re-visit models (some were created many years ago before RNA-seq was really even a thing - historically we have a bit of a mish mash). Our RNA-seq data would fit nicely into the model proposed above. The microarray and in-situ models are more complex. For instance, for the microarrays, we have two samples and multiple expression scores (various affymetrix measurements) and info on probes etc. I'll put a couple of our models below.

rachellyne avatar Oct 20 '20 10:10 rachellyne

RNA-seq:

  <class name="RNASeqResult" is-interface="true"/>
     <attribute name="expressionScore" type="java.lang.Double"/>
    <attribute name="tissue" type="java.lang.String"/>
    <attribute name="expressionType" type="java.lang.String"/>
    <reference name="gene" referenced-type="Gene" reverse-reference="rnaSeqResults"/>
    <collection name="dataSets" referenced-type="DataSet" />
  </class>

rachellyne avatar Oct 20 '20 10:10 rachellyne

Affymetrix arrays:

<class name="FlyAtlasResult" extends="MicroArrayResult" is-interface="true">
  <attribute name="affyCall" type="java.lang.String"/>
  <attribute name="presentCall" type="java.lang.Integer"/>
  <attribute name="enrichment" type="java.lang.Double"/>
  <attribute name="mRNASignal" type="java.lang.Double"/>
  <attribute name="mRNASignalSEM" type="java.lang.Double"/>
  <reference name="tissue" referenced-type="Tissue" reverse-reference="expressionResults"/>
</class>
<class name="Tissue" is-interface="true">
  <attribute name="name" type="java.lang.String"/>
  <collection name="expressionResults" referenced-type="FlyAtlasResult"  reverse-reference="tissue"/>
</class>
  <class name="MicroArrayResult" is-interface="true">
    <attribute name="scale" type="java.lang.String"/>
    <attribute name="type" type="java.lang.String"/>
    <attribute name="isControl" type="java.lang.Boolean"/>
    <attribute name="value" type="java.lang.Float"/>
    <reference name="experiment" referenced-type="MicroArrayExperiment" reverse-reference="results"/>
    <reference name="material" referenced-type="ProbeSet" reverse-reference="results"/>
    <collection name="assays" referenced-type="MicroArrayAssay" reverse-reference="results"/>
    <collection name="reporters" referenced-type="Reporter" reverse-reference="results"/>
    <collection name="genes" referenced-type="Gene" reverse-reference="microArrayResults"/>
    <collection name="samples" referenced-type="Sample"/>
    <collection name="dataSets" referenced-type="DataSet"/>
  </class>
    <class name="MicroArrayExperiment" is-interface="true">
        <attribute name="identifier" type="java.lang.String"/>
        <attribute name="description" type="java.lang.String"/>
        <attribute name="name" type="java.lang.String"/>
        <collection name="assays" referenced-type="MicroArrayAssay" reverse-reference="experiment"/>
        <collection name="results" referenced-type="MicroArrayResult" reverse-reference="experiment"/>
    </class>
    <class name="MicroArrayAssay" is-interface="true">
        <attribute name="name" type="java.lang.String"/>
        <attribute name="description" type="java.lang.String"/>
        <attribute name="sample1" type="java.lang.String"/>
        <attribute name="sample2" type="java.lang.String"/>
        <attribute name="displayOrder" type="java.lang.Integer"/>
        <reference name="experiment" referenced-type="MicroArrayExperiment" reverse-reference="assays"/>
        <collection name="results" referenced-type="MicroArrayResult"  reverse-reference="assays"/>
        <collection name="samples" referenced-type="Sample" reverse-reference="assays"/>
    </class>
    <class name="Sample" extends="BioEntity" is-interface="true">
        <attribute name="materialType" type="java.lang.String"/>
        <attribute name="name" type="java.lang.String"/>
        <attribute name="description" type="java.lang.String"/>
        <attribute name="primaryCharacteristic" type="java.lang.String"/>
        <attribute name="primaryCharacteristicType" type="java.lang.String"/>
        <collection name="assays" referenced-type="MicroArrayAssay" reverse-reference="samples"/>
        <collection name="characteristics" referenced-type="SampleCharacteristic"/>
        <collection name="treatments" referenced-type="Treatment"/>
    </class>
    <class name="SampleCharacteristic" is-interface="true">
        <attribute name="type" type="java.lang.String"/>
        <attribute name="value" type="java.lang.String"/>
        <reference name="ontologyTerm" referenced-type="OntologyTerm"/>
    </class>
    <class name="Treatment" is-interface="true">
        <attribute name="action" type="java.lang.String"/>
        <collection name="protocols" referenced-type="Protocol"/>
        <collection name="parameters" referenced-type="TreatmentParameter" reverse-reference="treatment"/>
    </class>
    <class name="TreatmentParameter" is-interface="true">
        <attribute name="type" type="java.lang.String"/>
        <attribute name="value" type="java.lang.String"/>
        <attribute name="units" type="java.lang.String"/>
        <reference name="treatment" referenced-type="Treatment" reverse-reference="parameters"/>
    </class>
    <class name="Protocol" is-interface="true">
        <attribute name="name" type="java.lang.String"/>
        <attribute name="description" type="java.lang.String"/>
    </class>
    <class name="ProbeSet" extends="BioEntity" is-interface="true">
        <collection name="results" referenced-type="MicroArrayResult" reverse-reference="material"/>
    </class>
    <class name="Reporter" is-interface="true">
        <attribute name="isControl" type="java.lang.Boolean"/>
        <attribute name="failType" type="java.lang.String"/>
        <attribute name="controlType" type="java.lang.String"/>
        <reference name="material" referenced-type="BioEntity"/>
        <collection name="results" referenced-type="MicroArrayResult" reverse-reference="reporters"/>
    </class>
    <class name="Gene" is-interface="true">
        <collection name="microArrayResults" referenced-type="MicroArrayResult" reverse-reference="genes"/>
    </class>

rachellyne avatar Oct 20 '20 10:10 rachellyne

Thanks, Rachel. Yes, I had RNA-seq in mind with the proposal, since that's what we're storing in the LIS mines. One comment is that we should be sure to write the core model to handle expression experiments that deal with samples of a single tissue but with various "treatments" (which could be mutations). ExpressionSample should include the tissue attribute, but should also contain what's special about the sample if it's not the tissue. A concrete example is an Arabidopsis experiment I worked on where we had controls, GR-REV, GR-STM, GR-AS2, and GR-KAN mutant lines. All samples were seedling leaves. And all were treated with dexamethazone with varying times before freezing. So there were mutants and treatments but only seedling leaves.

sammyjava avatar Oct 20 '20 15:10 sammyjava