memote icon indicating copy to clipboard operation
memote copied to clipboard

[discussion] Setting the weights for memote's tests.

Open ChristianLieven opened this issue 7 years ago • 10 comments

We want to come up with a reasonable weighting of the categories and of individual tests within the categories. We're already very sure that we can make a larger distinction between soft and hard tests.

Soft = 'Syntax' and 'Annotation'. If a model scores bad here, then the predictive capabilities of the model could still be fine, it would only be rather difficult to share or for outsiders to use the model in a different setup.

Hard = 'Basic', 'Consistency' and 'Biomass' [and 'Experimental']. A bad score in these categories often means a model may not be biologically meaningful or operational. It may not be possible to rely on it's predictions.

Let's discuss the details of a possible weighting scheme in here.

ChristianLieven avatar Oct 04 '17 13:10 ChristianLieven

I think we just need to get started with some penalty function so we can have something concrete to discuss.

phantomas1234 avatar Oct 13 '17 07:10 phantomas1234

For tests that output a number of reactions, metabolites etc. we could use the fraction as the preliminary score.

For instance: The fraction of blocked reactions (in rich medium) should be low. The fraction of reactions without GPR should be low. The fraction of metabolites without mass and charge should be low. The fraction of metabolites with annotation should be high. etc.

ChristianLieven avatar Oct 13 '17 14:10 ChristianLieven

I think we should have an ideal perfect model as a reference toy to begin with probably 100 full score. Then we play with changes to reduce the score based on the results from the memote test. Probably the core model of ecoli would be a good one ?? what do you think ?

intawat avatar Oct 15 '17 13:10 intawat

I would say that soft tests do not really fail, but rather give a warning. Obviously you would want people to a global standard (and BiGG's standard). Some models I used from BiGG weren't annotated correctly to BiGG's standards actually. I guess these types tests either pass of fail (1 or 0), only two outcomes possible.

I agree on the fact that hard tests should have some sort of gradient. The example @ChristianLieven gave should be useful (dividing the outcome with total number of reactions available), providing a score between 0 and 1 (and inverted based on the question). But how would you specify the thresholds after which a test passes? Or should you even define those? Maybe adding all the outcome values and calculating the percentage of accuracy using the total number of points that could be obtained within the current model.

penuts7644 avatar Oct 16 '17 21:10 penuts7644

But how would you specify the thresholds after which a test passes? Or should you even define those?

Yeah! I was also thinking of this too much in terms of unittesting. Really what I think people want is just a continuous score. So I wouldn't exactly define any thresholds at all. In fact I think for now:

[...] adding all the outcome values and calculating the percentage of accuracy using the total number of points that could be obtained within the current model.

is the way to go indeed!

ChristianLieven avatar Oct 17 '17 11:10 ChristianLieven

I might also like a "known failure" category (cf examples in pytest and https://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.testing.decorators.knownfailureif.html. They serve a different purpose than discovering accidental breakage, letting you document mis(sing) features, which is useful in summaries, remembering to update docs when fixed, etc.

jonovik avatar Oct 22 '17 18:10 jonovik

I think this is a relevant issue, because the final assessment of a model (and the comparison with others) will be measured and summarized based on it. Currently, the long list of independent test passed of failed does not clarified the quality of the model.

Apart of a global weighted score, I would also provide with a weighted score per each category (basic, biomass, consistency, annotation, syntax). By this way, an user could evaluate a model depending on its interest, objective to use the model. And comparing models in different aspects (maybe a model could be better in one category, and other model in a different category).

Besides, the independent biomass tests should have a global or average score collecting all the biomasses functions results. I means, a score for 'test_biomass_consistency', other score for 'test_biomass_precursors_default_production', etc. The same for all the tests repeated for different sources, such as 'test_detect_energy_generating_cycles'

beatrizgj avatar Oct 24 '17 09:10 beatrizgj

I like the concept of scoring, but fear that scores that are too general may lead to meaningless comparison between reconstructions, or misguided curation efforts (e.g. curating to achieve a high score, rather than curating to make the reconstruction as predictive/representative as possible for the intended purpose).

To illustrate the point, let's consider some of the "easy to score" tests that @ChristianLieven brought up (sorry to pick on them, I realize these were just quick examples):

The fraction of blocked reactions (in rich medium) should be low.

While unblocking these reactions might improve performance, and removing them may reduce the size of the reconstruction, penalizing their inclusion may harm the use of the reconstruction as a knowledgebase. For example, if a particular understudied organism has a pathway that is blocked because it contains a novel, unidentified reaction that connects it to the rest of the network, the reconstruction's score might be penalized for including this blocked pathway, even though identification/characterization of the blocked portions may represent important biological knowledge. If the goal is to improve such a reconstruction in an iterative fashion, I don't think that having such a metric contribute to the overall score for an organism will encourage that.

The fraction of reactions without GPR should be low.

Similar to the logic above, this might discourage authors from including reactions for which there is substantial experimental evidence, yet no known gene.

I think the best compromise is the suggestion that @beatrizgj made, e.g. there should be category-specific scores. I favor leaving an overall score out entirely, although I realize that makes it more difficult to communicate the overall quality of the reconstruction. Maybe presenting only the results of the 'Hard' tests as the overall score might work better (e.g. if there's a mass balance issue, I can't think of any way that penalizing the overall score would hurt the science).

I also particularly like the idea of known failures @jonovik . Framing some of the test results like that could guide/prioritize future curation efforts in a way that I think is more constructive than reporting a continuous score.

gregmedlock avatar Oct 26 '17 14:10 gregmedlock

I get your points @gregmedlock. Favouring misguided curation efforts is a very real possibility when providing a score. Summarising the thoughts so far: (A)There are tested aspects where the score is not easily quantifiable because these aspects depend on the preferred use/ or underlying biology of a metabolic reconstruction ('soft tests'). Opposed to that, there is (B) a set of tests that can be quantified quite well as they purely depend on modelling paradigms (let's call them 'hard tests'), and perhaps, similarly opposed to that are (C) the tests that run against provided experimental data, which are context-dependent 'hard tests'.

I would also like to point out that with memote's two fundamental workflows we're looking at separate problems:

  • For the Snapshot report, i.e. the Benchmark for editors/reviewers, I think a single score on an immutable set of tests is essential to enable fast decision-making.
  • For the 'Complete Development Cycle' workflow, i.e. continuous reconstruction, I agree that it would be helpful to allow users to 'customize' their test suite, allow them to indicate 'known failures', disable certain tests (those that are not applicable to the draft yet), or inject and run custom tests, specific to their organism and their project goal, while at the same time ensuring that the core set is untouchable.

So, the way I see it, really we are faced with several problems here:

  1. We first ought to define (B) (see #254 for that) and then construct a reasonable base score for it.
  2. I would like us to at least consider possibilities of then extending this base score by some simplified metrics that somehow summarise the 'soft tests'. If this isn't possible then we can still make all the metrics available in the report as general statistics.

ChristianLieven avatar Oct 26 '17 15:10 ChristianLieven

After some discussion with @gregmedlock and Jason today, we've collected some thoughts on the presence of an "overall" score:

Overall Score:

  • We want to avoid ‘senselessly’ forcing users to adherence to one specific standard. Flexibility (and creativity) may be required to capture novel phenomena.
  • Cultural differences/ competitiveness in the field may elicit different responses to a score: It could be interpreted both as motivating or as intimidating. It may invite undifferentiated “blanket statements”.
  • A score-centric report may further aggravate an undifferentiated interpretation of the results in question by putting a specific score into a user’s head from the start.

In response to that, I've opened issue #526.

ChristianLieven avatar Nov 13 '18 15:11 ChristianLieven