triage icon indicating copy to clipboard operation
triage copied to clipboard

Metric calculation is bogus

Open nanounanue opened this issue 7 years ago • 5 comments

precision calculation is currently taking predictions for several as of dates, and calculating precision across all of them together, resulting in bogus results. need to look at how to do it for each as of date separately and then aggregate or something more reasonable.

nanounanue avatar Sep 19 '17 01:09 nanounanue

Not actionable as written. Closing, can reopen with more details if needed.

thcrock avatar Jan 19 '18 21:01 thcrock

Given the following temporal configuration:

temporal_config:
    feature_start_time: '2010-01-04'
    feature_end_time: '2019-01-01'
    label_start_time: '2015-02-01'
    label_end_time: '2019-01-01'

    model_update_frequency: '1y'
    training_label_timespans: ['1month']
    training_as_of_date_frequencies: '1month'

    test_durations: '1y'
    test_label_timespans: ['1month']
    test_as_of_date_frequencies: '1month'

Resulting in the following temporal configuration:

inspections_baselinepng

As you can see, we will realize 12 different predictions in the test using the train model.

Should we get 12 different metric calculations? An array? Just the total one?

nanounanue avatar Feb 06 '19 18:02 nanounanue

My feeling on this is that there should be a different set of parameters in your temporal config, test_frequency and test_interval or somesuch that determines how many and which test matrices your model is evaluated on, and the test_duration and test_example_frequency are for how many and which dates to perform a single evaluation on (whether combining all of the dates in the way currently done makes sense is, I think, debatable). When we initially wrote the test_duration and test_example_frequency keys, we were thinking of cases where test predictions are also event-based, so each date may be sparsely labeled and combining multiple dates is necessary.

I feel like there are already issues to this effect somewhere.

ecsalomon avatar Feb 07 '19 04:02 ecsalomon

Ah, yes, I said the same thing in #378. Doesn't make me right, just consistent. :)

ecsalomon avatar Feb 07 '19 04:02 ecsalomon

Another thought on this: We are doing evaluations the same way (making one evaluation over all dates) in both test and train. For EWS problems, presumably, this method is equally bogus in both train and test. Should there be a flag to control this behavior?

ecsalomon avatar Feb 07 '19 16:02 ecsalomon