mir_eval Collection processing and dataframes

A pretty common use case for mir_eval is to iteratively call mir_eval.[task].evalute(...) on a sequence of pairs of reference and estimates, then collect the results afterward into a dataframe for subsequent analysis.

This pattern is so common in fact, that it may be worth providing some scaffolding to streamline and standardize it.

What I'm thinking is something like the following pattern:

df = mir_eval.collections.evaluate(task='beat', generator, **kwargs)

where generator is a generator that yields dictionaries containing the fields necessary as input to the given task's evaluator (e.g., ref_intervals and est_intervals or whatever), optionally an id field (otherwise a counter index is constructed while iterating), and kwargs are additional kwargs for the evaluator.

The resulting df would have as columns all of the fields of the task's evaluator function, and an index column to key on the provided (or generated) id.

Caveats

Doing this would probably necessitate adding pandas as a dependency. I think this is fine in 2025.
We may need to build out some scaffolding utilities (to be housed under a collections module) to make it easier to build these generators. I don't have a great sense of how this would look yet, but it would probably become clear after prototyping the core functionality and using it a bit.

Jun 05 '25 17:06 bmcfee

Hey Brian,

you are referring to this kind of pattern?

ref = list[dict] # keys are the track_name
est = list[dict] # keys are the track_name

df = []

for cur_track_name in annotations:
    cur_eval = dict()
    cur_eval["track_name"] = cur_track_name
    cur_eval["f1"] = mir_eval.do_some_eval(ref[cur_track_name], est[cur_track_name])
    df.append(cur_eval)

df = pd.DataFrame(cur_eval)

Sure, this is done very frequently. The question is, whether we can come up with a design which simplifies this, I currently cannot think of one because the initial boilerplate code is so simple itself.

I personally like to have the loop to build my own dataframe, this is similar to torch training pipeline versus keras fit() function which does these things implicitly. Sometimes I put in more derived metadata which I the use for the statistics.

However, when you have the data loaders fixed as is in mir-data, you can highly benefit from this. But this should then maybe be made available in these downstream packages instead of mir_eval.

Maybe you can provide a small sketch?

Jun 05 '25 18:06 stefan-balke

The question is, whether we can come up with a design which simplifies this, I currently cannot think of one because the initial boilerplate code is so simple itself.

💯 I agree that it's not obvious that such a design would be simpler than DIYing it.

Part of my thinking here though is that we may at some point want to consider building up some higher-level collection reporting functionality on top of the stimulus-level evaluation that we currently have. If we do go down that route, then standardizing the data structures for input will become necessary, and that makes a stronger case for some standard collection-level helpers as described above.

However, when you have the data loaders fixed as is in mir-data, you can highly benefit from this. But this should then maybe be made available in these downstream packages instead of mir_eval.

Yes, with the caveat that tools like mirdata really aren't designed to support annotations outside the references provided with the dataset (ie model outputs).

Maybe you can provide a small sketch?

I'll keep thinking on it...

Jun 05 '25 18:06 bmcfee

I think the key point is (again, sigh) data standardization here. I agree that this would make many things easier and is in fact standard practice in any data migration project I have attended. We have one chance to make the entrance barrier as low as possible in the interface. This would mean you can transform your current annotations to a collection. Maybe the collection puts some constraints/assertions on the annotations, e.g., unique timestamps per annotation or fixed dictionary for chord labels. THis needs to be super simple, as simple as the loop above, I mean, when they use mir_eval, they already "obey" some rules on the structure of the annotations.

Maybe this collection could then be exported to some JSON-style format...:-)

Jun 05 '25 18:06 stefan-balke

I get where this idea is coming from but, absent any "collection-level" information/analysis we want to provide, I wonder if we can just provide some example code in the documentation as a pointer, since it's pretty short as-is (and a few lines of custom code also lends a lot of flexibility).

Jun 05 '25 18:06 craffel

Are there any tasks other than #423 (especially ones we already cover) that require nontrivial collection-level aggregation?

Jun 05 '25 19:06 craffel

Are there any tasks other than #423 (especially ones we already cover) that require nontrivial collection-level aggregation?

If we wanted to move ahead on #346 , that probably would.

I've also been thinking about proposing a generic labeled interval classification module (eg for instrumentation). sed_eval covers that task, but almost entirely from the collection-level perspective (ie macro averaging), and there isn't at present a great solution for instance-/track-level evaluation (micro-averaging). If we wanted to build the latter, it's inherently a collection problem.

Jun 05 '25 19:06 bmcfee