mesa icon indicating copy to clipboard operation
mesa copied to clipboard

feat: Implement experimental DataCollector API

Open rht opened this issue 1 year ago • 24 comments

This is an attempt to implement the API as discussed in https://github.com/projectmesa/mesa/discussions/1944. I figure it is easier to comment on a PR than on a linear GH thread.

I have constrained the implementation to be as simple as possible, as such, feature like retrieving multiple attributes

"pos":collect(model.agents, ["x", "y"],), # retrieve multiple attributes

is not implemented, because the API then unnecessarily gets bigger, needs more testing, and has bigger surface area for bugs and gotchas. Edit1: at least not until the initial small API has become well tested.

In this implementation, instead of {name1: collect(collection, func1), name2: collect(collection, func2), it is {collection: {name1: func1, name2: func2}}.

Note: has edit1.

rht avatar Jan 28 '24 08:01 rht

Performance benchmarks:

Model Size Init time [95% CI] Run time [95% CI]
Schelling small 🔵 -0.1% [-0.4%, +0.2%] 🔵 +0.1% [-0.1%, +0.2%]
Schelling large 🔵 -0.3% [-1.0%, +0.5%] 🔵 -0.8% [-1.7%, +0.1%]
WolfSheep small 🔵 +0.4% [+0.0%, +0.8%] 🔵 +0.3% [+0.2%, +0.4%]
WolfSheep large 🔵 +0.3% [-0.9%, +1.4%] 🔵 +0.6% [-0.3%, +1.4%]
BoidFlockers small 🔵 +0.1% [-0.5%, +0.7%] 🔵 +1.0% [+0.4%, +1.5%]
BoidFlockers large 🔵 -0.9% [-1.2%, -0.5%] 🔵 +0.5% [-0.0%, +1.0%]

github-actions[bot] avatar Jan 28 '24 08:01 github-actions[bot]

The implementation may not use the observer pattern, but at least it allows parallel evolution of the API design, so that we can merge this once there is a consensus, and implement #1145 on top of the new API.

rht avatar Jan 28 '24 11:01 rht

Thanks for picking this up. Leaving the API aside for now, I notice that in your code, you try to solve everything within a single datacollector class. This is different from my thinking. Let me try to articulate it here. Once I have some time, I'll also try to give a draft implementation.

  1. I want a DataCollector class. This is effectively a container of Collectors and the primary point of interaction for the user.
  2. I want a Collector class. This class would be responsible for gathering the data from a single object. This class would also be responsible for returning a data frame/series when requested.
  3. It might be necessary to have multiple Collector classes because, for example, how you interact with the model object is different from how you interact with an AgentSet.
  4. I was considering using a factory method, collect for constructing the appropriate Collector instance based on the provided arguments.
  5. It might be possible to have collectors who operate on other collectors, but I am unsure about this.

I doubt solving the data collection problems within a single class is possible. It is bound to violate the single responsibility principle and produce code that is difficult to read.

quaquel avatar Jan 28 '24 19:01 quaquel

This PR is mainly discussing about the API. The backend implementation can be refactored later.

  1. I was considering using a factory method, collect for constructing the appropriate Collector instance based on the provided arguments.

I have commented on this in the PR description:

In this implementation, instead of {name1: collect(collection, func1), name2: collect(collection, func2), it is {collection: {name1: func1, name2: func2}}.

Because the latter is more concise.

rht avatar Jan 28 '24 20:01 rht

I personally find nested dicts virtually unreadable.

In my original proposal, I saw each Collector as having a name. The user can retrieve each collector by this name from the CollectorContainer/DataCollector. Next, a Collector is nothing but the retrieval of one or more things from an object. This object can be the model, an agentset, or whatever. Hence, I wanted to keep the object and what is being collected together.

I agree that you can potentially end up in a situation where you want to collect multiple things from the same object. But that is relatively easy to handle within a Collector class. At least as long as you only want to retrieve attributes.

Worrying about conciseness is relevant, but not at the expense of clarity and at least some consideration of the underlying implementation.

quaquel avatar Jan 28 '24 20:01 quaquel

Worrying about conciseness is relevant, but not at the expense of clarity and at least some consideration of the underlying implementation.

For consideration about clarity, let's compare them side by side. Without collect (the mental model is that you specify a collector by 2 keys: the collection/group, and the name)

collectors = {
    model: {
        "n_quiescent": lambda model: len(
            model.agents.select(
                agent_type=Citizen, filter_func=lambda a: a.condition == "Quiescent"
            )
        ),
        "gini": lambda model: calculate_gini(model.agents.get("wealth")),
    },
    get_citizen: {"condition": "condition"},
    "agents": {"wealth": "wealth"},
}

With collect

collectors = {
    "n_quiescent": collect(
        model,
        lambda model: len(
            model.agents.select(
                agent_type=Citizen, filter_func=lambda a: a.condition == "Quiescent"
            )
        ),
    ),
    "gini": collect(model, lambda model: calculate_gini(model.agents.get("wealth"))),
    "condition": collect(get_citizen(), "condition"),
     "wealth": collect(model.agents, "wealth"),
}

In what way is the latter clearer? The storage of both cases are still considered as a DF with 2 indexes: the group/collection and the name.

rht avatar Jan 28 '24 20:01 rht

For me, the second is more straightforward to read because it is flat.

quaquel avatar Jan 28 '24 20:01 quaquel

In case you are concerned about the flat/nested structure, how about

collectors = {
    ("n_quiescent", model): lambda model: len(
        model.agents.select(
            agent_type=Citizen, filter_func=lambda a: a.condition == "Quiescent"
        )
    ),
    ("gini", model): lambda model: calculate_gini(model.agents.get("wealth")),
    ("condition", get_citizen): "condition",
    ("wealth", lambda: model.agents): "wealth",
}

?

rht avatar Jan 28 '24 20:01 rht

Proposal 4: separating between group selectors and collectors

groups = {
    "quiescents": lambda: model.agents.select(
        agent_type=Citizen, filter_func=lambda a: a.condition == "Quiescent"
    ),
    "citizens": lambda: model.get_agents_of_type(Citizen),
}
collectors = {
    ("n_quiescent", "quiescents"): len,
    ("gini", model): lambda model: calculate_gini(model.agents.get("wealth")),
    # Edit: a better way to do the former:
    ("gini", "agents"): lambda agents: calculate_gini(agents.get("wealth")),
    ("condition", "citizens"): "condition",
    ("wealth", "agents"): "wealth",
}

rht avatar Jan 28 '24 21:01 rht

I have to think on that. For example, I am unsure how to read the last lambda statement.

Note, however, where we agree. Data collection involves

  1. an object
  2. something to collect from this object
  3. and/or an operation to apply to the object / what was collected in 2
  4. a name by which whatever is collected will be known.

So it seems we need at least a Collector class:

class Collector:
    def __init__(self, name : str, obj : Any, attrs: str | List[str], func: Callable = None ):
        ...

If we want brevity, we might make the name optional and default the name to the attribute name if name is not provided.

To be clear, this Collector class is a building block in the overall architecture. Not the actual API that the user would need to interact with.

added after seeing proposal 4:

Yes, I think you are on to something here. Basically, your groups are contextual objects (i.e., they change over time) that you want the data collector to operate on. So yes, we might need this.

quaquel avatar Jan 28 '24 21:01 quaquel

For example, I am unsure how to read the last lambda statement.

The reason why I added lambda instead of straight model.agents is because the latter returns an immutable AgentSet. model.agents might have more/fewer agents since the previous data collection.

So it seems we need at least a Collector class:

My proposal 4 splits further the Collector class into a GroupSelector and a collector function. The reason is that I want to reuse the GroupSelector in various collectors, without having to define a named function, because it'd be less concise.

rht avatar Jan 28 '24 21:01 rht

Meta: we are having 4 different terms now:

  • space: Cell & "Collection"
  • time: Agent & "Set"
  • observation: Collector & "Group"
  • parallel universe: Runner & "Batch" (or configuration?)

I was wondering where I should have called it collector and collection, instead of collector and group

rht avatar Jan 28 '24 21:01 rht

Fair enough, I guess group is indeed a set or a collection.

another question related to your remark

There is a problem with this line: since this function is defined within a model's __init__,

At the moment, data collection is defined within the model init. For me, this has always been a bit strange. A model runs. Data collection is conceptually external to this. This is just a weird idea for discussion's sake but what about

model = SomeModel()

# Setup data collection
data_collector = DataCollector("whatever the API will look like")

for _ in range(100):
    model.step()
    data_collector.collect()

At least in this way, you have a clean separation of concerns.

quaquel avatar Jan 28 '24 21:01 quaquel

@quaquel I tried to implement https://github.com/projectmesa/mesa/pull/2013#issuecomment-1913726193 (i.e. separate data collector spec and object from model __init__) in the experimental Schelling example, but encountered a road block where the current JupyterViz accepts only the model class -- it assumes that the data collector spec is soldered into the model spec. However, the data collector spec is rather small:

        self.datacollector = mesa.DataCollector(
            {"happy": "happy"},  # Model-level count of happy agents
        )

That said, I find it reasonable to separate the data collector spec for larger models.

rht avatar Jan 28 '24 21:01 rht

Fair enough, I guess group is indeed a set or a collection.

another question related to your remark

There is a problem with this line: since this function is defined within a model's __init__,

At the moment, data collection is defined within the model init. For me, this has always been a bit strange. A model runs. Data collection is conceptually external to this. This is just a weird idea for discussion's sake but what about

model = SomeModel()

# Setup data collection
data_collector = DataCollector("whatever the API will look like")

for _ in range(100):
    model.step()
    data_collector.collect()

At least in this way, you have a clean separation of concerns.

I also had this idea somewhere that similar to our batch_run function we add a model_run function, where you can attach a datacollector and/or stop condition to the model. I agree that data collector should be separate to the model definition. I disagree that our visualisation module should depend on a data collector, for me data collection runs and visualisation runs are conceptually different and usually don't depend on the same variables (e.g. mesa-interactive has no such dependency).

Corvince avatar Jan 28 '24 21:01 Corvince

I disagree that our visualisation module should depend on a data collector, for me data collection runs and visualisation runs are conceptually different and usually don't depend on the same variables (e.g. mesa-interactive has no such dependency).

In some situations, the viz module has to depend on the data collection output. The Schelling's happy agent count is taken from the data collector output, not the model's direct attribute.

This means that the structure for small model and large model has to diverge. For small model, it's OK to solder the data collection, and have JupyterViz detects if the model contains a datacollector attribute. Otherwise, it looks for the optional argument to fetch the DF from the data collector object(s).

rht avatar Jan 28 '24 22:01 rht

Any objections to proposal 4?

rht avatar Jan 28 '24 23:01 rht

This means that the structure for small model and large model has to diverge.

I disagree with this. In my view, we need to develop a design that scales from small to large models rather than stimulate using (dirty) shortcuts in small models. The ongoing discussion is really useful for this.

So, conceptually, data collection and visualization are separate. The current practice of hijacking the data collector thus needs to change. It also gives rise to a problem I recently ran into: I had a Jupyter visualization that ran slower and slower because the data frame being displayed in one of the graphs became bigger and bigger. Ideally, you only want to add new data to a visual element rather than replace historic data.

I, however, agree with @rht that sometimes there are model statistics that we both want to store for later analysis and display in a GUI. So, one possibility would be to have some kind of Statistic class. This class is only responsible for tracking some state variables within the model (and possibly doing operations on them). The datacollector mechanism could query these statistics objects on each collect call. That is, the persistent storage of statistics over time is the responsibility of the datacollector. A GUI, likewise, could query these statistics objects for display purposes. If a graph shows the dynamics over time, it is the responsibility of the GUI element to handle that. Not the responsibility of the Statistics object.

quaquel avatar Jan 29 '24 07:01 quaquel

Definitely not my best work, but I want to throw this PR in, for inspiration:

  • #1145

EwoutH avatar Jan 29 '24 07:01 EwoutH

I disagree with this. In my view, we need to develop a design that scales from small to large models rather than stimulate using (dirty) shortcuts in small models. The ongoing discussion is really useful for this.

This is akin to saying Python's print function shouldn't be part of builtins, that the user has to do a ceremony like in C/Java #include<stdio.h> because it is I/O, not programming logical building blocks. While in reality, Python does have sys.stdout.write in addition to print, signifying a beginner friendly interface by the latter.

But in the end, it just boils down to how the datacollector's collected data should be accessed, either from the model attribute or from an argument passed to JupyterViz. Is mainly a convention and doesn't affect the backend implementation that much. I can at least remove the hardcoding of model.datacollector.data_collection.to_df() later on.

It also gives rise to a problem I recently ran into: I had a Jupyter visualization that ran slower and slower because the data frame being displayed in one of the graphs became bigger and bigger. Ideally, you only want to add new data to a visual element rather than replace historic data.

The canned statistics functions in #1145 can help process the simulation state in a way that is single use only (or not stored unless by a state replayer).

rht avatar Jan 29 '24 12:01 rht

This is akin to saying Python's print function shouldn't be part of builtins,

No, Print is built on top of sys.stdout.write for convenience. I am advocating for doing exactly the same: building a proper structure first and then providing convenience functions that cover commonly encountered use cases that are built on top of the proper structure.

quaquel avatar Jan 29 '24 12:01 quaquel

No, Print is built on top of sys.stdout.write for convenience.

At least we are on the same page with the convenience of accessing a data collector via a model, for simple examples.

I am advocating for doing exactly the same: building a proper structure first and then providing convenience functions that cover commonly encountered use cases that are built on top of the proper structure.

This is not the full story. While I don't know how the concept of DataFrame came to be in R, at least I can see that the API of the DataFrame in pandas grew organically to suit usage needs, and that eventually faster backends (pyarrow instead of NumPy) were implemented. API and backend architecture may evolve semi-independently (as I said in https://github.com/projectmesa/mesa/pull/2013#issuecomment-1913567262). What matters is that the API should be designed in a way that doesn't restrict the backend possibilities.

If mistakes for the API are made, there is always Mesa 4.0 for further iterations (to begin with, the semver is devised to deal with API breaking changes, instead of architecture).

rht avatar Jan 29 '24 18:01 rht

At least we are on the same page with the convenience of accessing a data collector via a model, for simple examples.

I would not do that, even for simple models.

I want a convenient high-level API to specify data collection for simple cases (e.g., attributes from sets of agents and the model, end of life data from agents, simple callables operating on agents), which is built on a powerful set of classes and functions that users can use and extent for their own more complex models.

I also want to have a clean separation between data collection for later analysis and anything to do with user interfaces.

quaquel avatar Jan 29 '24 19:01 quaquel

I implemented proposal 4, https://github.com/projectmesa/mesa/pull/2013#issuecomment-1913721427, because it is strictly better than the original proposal in terms of deduplicating the way user specify the groups. The example in the docstring has been updated accordingly.

rht avatar Feb 03 '24 08:02 rht

Since we have concentrated the discussion in #1944, should we close this PR?

quaquel avatar Feb 24 '24 13:02 quaquel

Since we have concentrated the discussion in #1944, should we close this PR?

quaquel avatar Feb 24 '24 13:02 quaquel