Aggegrated agent metric in DataCollection, graph in ChartModule

Open EwoutH opened this issue 2 years ago • 15 comments

Aggegrated agent variable Currently it's not possible to quickly get a aggegrated metric of an agent variable. This PR adds a method to the DataCollector class called get_agent_metric that allows to quickly get a single value that describes the agent variable based on a stastistic.

By default, it takes the mean of the value of all agent's values for that variable. It always reports the variable in the current time step. The function supports all of statistics functions, as well as the built-in min(), max(), sum() and len() functions.

To support this:

statistics is imported
Adds agent_attr_index dictionary, which list the place of each reporter in the _agent_records dictionary
Adds self.agent_name_index, which can be used to lookup the reporter for each input variable name

Example A model called model1 is created, with agents that have an agent_reporter in datacollector variable called "Neighbours"

        self.datacollector = DataCollector(
            model_reporters={"Agents": lambda m: m.schedule.get_agent_count()},
            agent_reporters={"Neighbours": "neighbours"},
        )

The new get_agent_metric() function can now be used to get an aggerate level statistic of the number of neighbours of the agents:

model1.datacollector.get_agent_metric("Neighbours")
0.8984375
model1.datacollector.get_agent_metric("Neighbours", "min")
0
model1.datacollector.get_agent_metric("Neighbours", "median")
0.0
model1.datacollector.get_agent_metric("Neighbours", "max")
3

Plotting agent variables The ChartModule is also updated to support displaying agent variables. If it can't find a variable in the model variables, it checks if it is present in the agent variables, and if so, adds it to the chart.

Example In a Game of Life model I build the DataCollector looks like this:

        self.datacollector = DataCollector(
            model_reporters={"Agents": lambda m: m.schedule.get_agent_count()},
            agent_reporters={"Neighbours": "neighbours"},
        )

The server contains two charts, one with the Agents, which is a model variable, and one with Neighbours, an agent variable.

chart1 = ChartModule([{"Label": "Agents", "Color": "Black"}], data_collector_name="datacollector")
chart2 = ChartModule([{"Label": "Neighbours", "Color": "Black"}], data_collector_name="datacollector")

server = ModularServer(LifeModel, [grid, chart1, chart2], "Game of Life", {"p": 0.12, "width": 40, "height": 40})

On the main branch, only the first chart is displayed correctly. On the second, both are.

@tpike3, @rht and others, I would love your feedback on this PR! Please consider performance, the naming of variables and functions and API stability. Also please let me know if (and where) tests and documentation should be added.

Jan 23 '22 21:01 EwoutH

Codecov Report

Merging #1145 (51df4fe) into main (4a79705) will decrease coverage by 1.17%. The diff coverage is 25.00%.

@@            Coverage Diff             @@
##             main    #1145      +/-   ##
==========================================
- Coverage   89.30%   88.13%   -1.18%     
==========================================
  Files          19       19              
  Lines        1253     1289      +36     
  Branches      256      259       +3     
==========================================
+ Hits         1119     1136      +17     
- Misses         98      116      +18     
- Partials       36       37       +1

Impacted Files	Coverage Δ
mesa/visualization/modules/ChartVisualization.py	`70.96% <20.00%> (-20.70%)`	:arrow_down:
mesa/datacollection.py	`88.11% <28.57%> (-9.59%)`	:arrow_down:
mesa/space.py	`94.91% <0.00%> (-1.00%)`	:arrow_down:
mesa/batchrunner.py	`92.28% <0.00%> (+0.72%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4a79705...51df4fe. Read the comment docs.

Jan 23 '22 21:01 codecov[bot]

@tpike3 Do you have time to review this PR?

@rht Any more comments?

Jan 30 '22 08:01 EwoutH

self.agent_name_index is redundant with self.agent_reporters.
I find the extra agent_attr_index construct to be unnecessary. You can define aggregate measure of agent variables within the framework of the existing model-level data collection.

Jan 30 '22 08:01 rht

self.agent_name_index is redundant with self.agent_reporters.

Good catch, can't believe I missed that. I found it already weird that there wasn't such a dictionary, but there was. I fixed it in 0d1aeef.

I find the extra agent_attr_index construct to be unnecessary. You can define aggregate measure of agent variables within the framework of the existing model-level data collection.

The dictionary is created to keep track of which metric is collected where in the list of _agent_records. That information isn't really easily viewable for the user unfortunately, of course you can look up deep in the code where each number in [1, 2, 4.5, 6.4, 3.8] stands for, but that should be easier or handled by the back-end, like this approach does.

Anyways I don't think it makes a big performance impact and it does make the code a bit more resilient if agent_reporters are defined in a weird way.

But if you suggest an other implementation I'm open to incorporate it!

Jan 31 '22 17:01 EwoutH

I wasn't referring to _agent_records. i was referring to storing the values in model_vars. The aggregated agent metric is a model-wide measure, a summary of individual agent properties.

Feb 01 '22 00:02 rht

@Ewout, generically, I think this is a good idea, but it is a hard how to implement

1st a unhelpful philosophical rabbit hole: This is pretty profound. To put in my own terms, what is the right set up to optimize user ease, as the general population becomes more technically literate. This is a constant issue for me right now and speaks to @EwoutH's point what is the set up allow users to easily and intuitively see key parts of the model.

2nd some thoughts to hopefully be helpful:
@EwoutH as I didn't get a chance to play with it and really understand, But, to @rht's point can you describe the difference between model_vars and agent_attr_index. Couldn't you just put the metrics against the ```model_vars" or even the dataframe (which has some interesting ease of use and cost dynamics)?

To a specific question, the testing would go in the data_collector

Hope this helps.

Feb 03 '22 10:02 tpike3

Looked a bit more into it. They are indeed identical, but currently model_vars is used for model reporter variables, and agent_attr_index for agent reporter variables. I think it's better to keep them separate, just for the case in which there is a model variable and agent variable with the same name that are both collected.

I renamed it to agent_vars however, to make their similarity more clear.

So with my current skill set, I think this is the best implementation I can do. Then the question this, is this good enough in terms of performance and maintainability? If so, I can add tests and update the docs further.

If not, @rht would you be open to re-implementing this functionality from a clean sheet?

Feb 07 '22 11:02 EwoutH

Any aggregate metric, by definition, is a model-level variable. The examples you showed in https://github.com/projectmesa/mesa/pull/1145#issue-1111995305 can be put model_vars. Avoiding data collection key name collision is a separate problem. If any, calling it agent_vars is misleading, because once again, those are model-level vars.

You should stick to the existing API whenever possible. Adding more machinery will cause the library to be more complex and harder to learn.

I would do something like this:

def get_neighbors_min(model):
    neighbors = model.datacollector.get_last_agent_report("Neighbors")
    return min(neighbors)

# Later on in the agent reporter initialization
   model_reporters={"neighbors_min": get_neighbors_min, ...}

This way, the user is the one responsible for naming the model-level var, and there is no key collision at all.

Feb 07 '22 14:02 rht

Thanks for your comment, I now understand your issue.

The current architecture is as follow:

DataCollector collects a variable from all agents each timestep, keeping all values.
The get_agent_metric aggregates the values from all agents it to a single value.
ChartVisualization plots this single value each timestep.

What you suggesting is merging step 1 and 2, if I understand correctly. While this has the advantage it can simplify code and reduce the amount of information stored, it does throw away a lot of data that could be analysed afterwards.

Feb 10 '22 09:02 EwoutH

No data are thrown away. See my example. I took the agent-level vars from an existing, separately-defined agent reporter.

Feb 10 '22 13:02 rht

@rht @tpike3 @jackiekazil Maybe we could give this PR/idea another spin. I think the main questions are:

Do we want aggregated agent metrics (for plotting) in Mesa?
If so, how would a clean implementation look like, that (preferably) doesn’t break backwards compatibility?

Oct 07 '22 22:10 EwoutH

Following our discussion in the dev meeting earlier today, you might need to consider the case where different types of agents may have different attributes. As discussed, the implemented interface could be used as a default where all types of agents are assumed to share a common attribute.

Oct 29 '22 16:10 wang-boyu

Another major concern that I had is how to differentiate agent_reporters from model_reporters? How do we tell the users when to use agent_reporters vs. when to use the other?

Without this PR what I'll do would probably look very much like what was mentioned in https://github.com/projectmesa/mesa/pull/1145#issuecomment-1031524167.

As an alternative yet similar example:

def get_min_neighbors(model):
    return min([getattr(agent, "neighbors") for agent in model.schedule.agents])  # or model.grid.agents or other similar places

and use this in model_reporters.

The question is, is this generic enough to be provided as an API to the users? For instance we can have:

def get_agent_metric(model, attr_name, metric="mean"):
    values = [getattr(agent, attr_name) for agent in model.schedule.agents]  
    if metric in ["min", "max", "sum", "len"]:
        # similar to what was implemented in this PR
        result = ...
    else:
        result = ...
    return result

so that the users can do something like:

from functools import partial

model_reporters={
    "neighbors_min": partial(get_agent_metric, attr_name="neighbors", metric="min"),
    ...
}

Personally I don't think this is really needed, since the users can fairly easily define their own functions.

How about the agent_reporters like in this PR, i.e.

self.data_collector = DataCollector(
    model_reporters={"Agents": lambda m: m.schedule.get_agent_count()},
    agent_reporters={"Neighbours": "neighbours"},
)

vs.

self.data_collector = DataCollector(
    model_reporters={
        "Agents": lambda m: m.schedule.get_agent_count(),
        "Neighbours": get_min_neighbors,
    }
)

Again I don't really see the need to introduce agent_reporters here.

Oct 29 '22 16:10 wang-boyu

On a second thought, it might be useful when the users need to define lots of similar functions, such as:

self.data_collector = DataCollector(
    model_reporters={
        "Agents": lambda m: m.schedule.get_agent_count(),
        "Min Neighbours": get_min_neighbors,
        "Mean Neighbours": get_mean_neighbors,
        "Max Neighbours": get_max_neighbors,
    }
)

In this case it could be easier for the users to have a common interface such as agent_reporters or the get_agent_metric function mentioned previously, so that they don't have to rewrite lots of short functions. Sorry that I missed this point which was mentioned in the PR.

Oct 29 '22 16:10 wang-boyu

On a second thought, it might be useful when the users need to define lots of similar functions, such as:

self.data_collector = DataCollector(
    model_reporters={
        "Agents": lambda m: m.schedule.get_agent_count(),
        "Min Neighbours": get_min_neighbors,
        "Mean Neighbours": get_mean_neighbors,
        "Max Neighbours": get_max_neighbors,
    }
)

This is exactly what I see students (and myself, sometimes) do all the time. The main use case of this feature I see, is that you want to collect all the agent data for later proper statistical analysis, but you also want some quick values for eye-ball validation and visualisation.

If I want to do that with the current datacollector possibilities, I have to define both agent and model reporters, or write custom code to transform the agent data to the thing I want.

Also, I think that there should be a really easy way to plot a general statistic like the mean of an agent variable with real time visualisation. Heck, NetLogo does this for 20+ years.

Maybe some of the Solara stuff leapfrogs this, but those use cases should be included in my opinion:

Get quick aggerate values
Plot aggerate metrics in real time

(both while still collecting full agent data for proper analysis)

Oct 24 '23 11:10 EwoutH

mesa mesa copied to clipboard

Aggegrated agent metric in DataCollection, graph in ChartModule

Codecov Report

mesa
mesa copied to clipboard