mesa icon indicating copy to clipboard operation
mesa copied to clipboard

AgentSet: Allow selecting a fraction of agents in the AgentSet

Open EwoutH opened this issue 1 year ago • 6 comments

This PR adds a p parameter to the select method in the AgentSet class, allowing users to specify a fraction of agents to be selected from the set.

Motive

The existing select method only allowed selection based on a fixed number (n) of agents or a filter function. The addition of the p parameter enhances flexibility by enabling selection based on a percentage of the total agents, addressing scenarios where relative selection is more appropriate than absolute selection.

Implementation

The select method was updated to include an optional p parameter (defaulting to 1.0). If p is specified and less than 1.0, the method calculates the corresponding number of agents (n) to select as a fraction of the total. The code was modified to ensure compatibility with the existing functionality, including adjustments to the conditions that determine when the method should return early.

Usage Examples

# Select 20% of agents from the AgentSet
selected_agents = agents.select(p=0.2)

Together with #2254, you can now set a value for a fraction of your AgentSet:

# Select 40% of the agents from the AgentSet
model.agents.select(p=0.4).set('has_license', True)

This feature is particularly beneficial in models where the agent set size may vary, and proportional selection is required.

Additional Notes

Basically this is a shortcut for:

n_agents = len(some_agentset)
some_agentset.select(n=n_agents)

So it breaks chaining if you have to do this. Directly being able to select a fraction allows you to continue the chain.

EwoutH avatar Aug 28 '24 10:08 EwoutH

Performance benchmarks:

github-actions[bot] avatar Aug 28 '24 10:08 github-actions[bot]

what is the motivation for adding this to the agentset?

quaquel avatar Aug 28 '24 10:08 quaquel

That seems useful, thanks!

The only worry I have is how this behaves if a user specifies both n and p. That probably should raise an error?

Or maybe there is a good name that could incorporate both p and n? So if it is between 0 and 1 use a fraction and if it is a whole number above 1 use that number?

Corvince avatar Aug 28 '24 10:08 Corvince

what is the motivation for adding this to the agentset?

Sorry, was still working on other features (and my actual model), wrote it up.

That seems useful, thanks!

The only worry I have is how this behaves if a user specifies both n and p. That probably should raise an error?

Yeah I was thinking about that. Maybe just don't do that (and we mention it in the docstring)?

If you just want to select a fraction of n, you can do n=round(n*p), so having both doesn't make sense.

Or maybe there is a good name that could incorporate both p and n? So if it is between 0 and 1 use a fraction and if it is a whole number above 1 use that number?

Very interesting idea, but maybe in this case explicit is better than implicit. Except if you can come up with a killer name.

EwoutH avatar Aug 28 '24 10:08 EwoutH

I like the clarity of p. So my suggestion would be to raise a value error if both n and p are passed

quaquel avatar Aug 28 '24 10:08 quaquel

see the few minor comments and once unit tests are added, this is good to go.

quaquel avatar Aug 28 '24 11:08 quaquel

Okay, I:

  • Changed p to fraction
  • Used the ValueError
  • Updated the other docstring, including notes
  • Added tests
  • Updated the examples

However, I noticed that there's an important difference between n and fraction. n is always fixed, it's just an upper limiter. fraction does matter when you apply it, before or after the rest of the selection.

Currently fraction is interpreted as a fraction of the input AgentSet. When writing the usage examples that felt really counter intuitive. It would be more logical if you could apply it afterwards, such that a fraction of the selected AgentSet is returned.

Why? Because if you take these two use cases:

  • Select the agents with "wealth" less than 5 but at most 20% of total agents
  • Select the agents with "wealth" less than 5, and then 20% of those agents

The latter is used way more than the former. And it will be way more logical if you select by type.

So I would suggest applying fraction afterwards, on the selected AgentSet after all other operations are done. Then you could still do both:

# Select the agents with "wealth" less than 5, and at most 20% of total agents
agents.select(fraction=0.2).select(lambda agent: agent.wealth < 5)

# Select the agents with "wealth" less than 5, and then 20% of those agents
agents.select(lambda agent: agent.wealth < 5, fraction=0.2)
# or, equivalently:
agents.select(lambda agent: agent.wealth < 5).select(fraction=0.2)

But now the one that's more used and more intuitive will go well by default.


Totally other options could be:

  • Don't allow fraction and/or n with other functions, but enforce chaining
  • Introduce a new method, like sample, that give a sample of n or a sample of fraction.

EwoutH avatar Aug 28 '24 18:08 EwoutH

what is the motivation for adding this to the agentset?

@EwoutH I'm also wondering about this. Not saying that this shouldn't be in the library, but a concrete example could give some illustration. Is this used in your project?

rht avatar Aug 28 '24 20:08 rht

This was the thing I wanted to do:

# Randomly select 40% of the agents from the AgentSet and give them a license
model.agents.shuffle().select(fraction=0.4).set('has_license', True)

I needed to do this:

n_license = round(model.agents * license_chance)
model.agents.shuffle().select(n=n_license).do(lambda agent: setattr(agent, 'has_license', True))

With #2254 it got simplified to:

n_license = round(model.agents * license_chance)
model.agents.shuffle().select(n=n_license).set('has_license', True)

It's not a huge use case, but it's nice. Especially that you don't need to break the chain.

Combine it with a function and it get's really powerful though. Assume I want to distribute some cars around (I know a certain percentage of all people has a car), but only to agents with licenses.

agents.select(lambda a: a.has_license, fraction=car_chance).set('has_car', True)

Without the fraction, this would have been:

n_car = round(model.agents * car_chance)
model.agents.shuffle().select(n=n_car ).set('has_license', True)

So yeah, it's not a huge use case. Maybe it adds some complexity.


There's an unique application for fraction as upper limit (cap), as currently implemented, and a unique application for doing it afterwards. I need to think about this a bit longer.

EwoutH avatar Aug 28 '24 20:08 EwoutH

Right, n=0 has a special status. With a small fraction or small agentset, n can become 0, returning all agents.

EwoutH avatar Aug 28 '24 21:08 EwoutH

Right, n=0 has a special status. With a small fraction or small agentset, n can become 0, returning all agents.

Good catch!

I see two possibilities now. Either just change the special meaning from 0 to -1. I don't know if there was a good use case for 0, but it's rather strange for 0 to indicate all agents.

The more holistic approach would be to split select into a filter function and a sample function. This would also simplify the logic and solve the "before or after" question (which was present but unconsidered before fraction was introduced)

Corvince avatar Aug 29 '24 04:08 Corvince

The brain is so interesting that after a nights sleep you look at it again and you think oh, and it all clicks together.

Now I just have to write it up, rewrite the codes, tests and examples.

Can’t wait for 2026/2027 where with a voice message a bit does that automatically.


Long story short: There’s a special use case for when filtering, you want a certain number or fraction at most. Especially the fraction should happen right there in the function, because after the function is done, you don’t know how large the

For all other cases (before, after) a sample method would be perfect (and can be implemented pretty fast I think). sample could also draw a random sample, where select selects the first n/fraction.

Or maybe there is a good name that could incorporate both p and n? So if it is between 0 and 1 use a fraction and if it is a whole number above 1 use that number?

Obviously the way to go. I was thinking max, limit, ceiling or at_most.

EwoutH avatar Aug 29 '24 05:08 EwoutH

Looking forward at what @EwoutH comes up with. But I like @Corvince suggestion of having filter and sample. From a pure performance standpoint, minimizing the looping required for the use cases described here would be really beneficial. Chaining multiple select or shuffle and select, for something like "give me a random sample of 40% of the agents that have a particular attribute" is not ideal. It requires multiple loops where 1 should be sufficient. In particular, in the case of large numbers of agents, this becomes very inefficient.

Of course, a user could do most of this stuff with a well designed custom function passed to e.g. select so performance is not the only design concern and their are already clean ways of handling those.

quaquel avatar Aug 29 '24 06:08 quaquel

Agreed on the performance aspect. One way to solve this but keep the chainable approach would be to use generator functions to return iterators instead of the complete AgentSet. But maybe as you said this is all mainly catered towards nice semantics and there are other ways already available for performance critical operations.

Corvince avatar Aug 29 '24 08:08 Corvince

Agreed on the performance aspect. One way to solve this but keep the chainable approach would be to use generator functions to return iterators instead of the complete AgentSet. But maybe as you said this is all mainly catered towards nice semantics and there are other ways already available for performance critical operations.

That's an interesting idea worth exploring at some point (but not this PR). Basically, what if we have a generator interface to an AgenSet? And can we make a chainable API work with generators?

quaquel avatar Aug 29 '24 09:08 quaquel

It seems we keep coming back to this (https://github.com/projectmesa/mesa/pull/2220#issuecomment-2297117745), so it’s certainly worth exploring at some point.

EwoutH avatar Aug 29 '24 09:08 EwoutH

Agreed on the performance aspect. One way to solve this but keep the chainable approach would be to use generator functions to return iterators instead of the complete AgentSet. But maybe as you said this is all mainly catered towards nice semantics and there are other ways already available for performance critical operations.

That's an interesting idea worth exploring at some point (but not this PR). Basically, what if we have a generator interface to an AgenSet? And can we make a chainable API work with generators?

I think having an __iter__ method is kind of enough, so

(agent for agent in agentset)

should already give you an iterator over the agentset. Definitely worth exploring that more, but certainly way out of scope for this PR

//Edit Ah, sorry, didn't think this through. Definitely needs more thought on the possibility to make this chainable. This if course only iterates over the agents themselves

Corvince avatar Aug 29 '24 10:08 Corvince

I updated this PR to replace n with max.

max (int | float, optional): The maximum amount of agents to select. Defaults to infinity.

  • If an integer of 1 or larger, the first n matching agents are selected.
  • If a float between 0 and 1, at most that fraction of original the agents are selected.

Some details:

  • max=1 will give one agent, max=1.0 gives all agents.
  • A fallback for n was added, which does max = n and throws a warning.

Tests are updated. Please double check the internal agent_generator function.

If we decide this is the way to go, I will update the PR description.


I plan on adding a separate sample() function that implements max in the same way, including with a shuffle=True option. Fun fact: sample(n, shuffle=True) will be equivalent to NetLogo's up-to-n-of. @quaquel I know you hate NetLogo with all your hearth, but sometimes you can learn a lot from them ;).

But that would be separate PR.

EwoutH avatar Aug 29 '24 15:08 EwoutH

I am unsure about using a single keyword for both the number and the percentage, but I won't object to it either. I would change the name, however. max shadows the name of a build-in.

It would be nice to see a quick overview of what the API is now becoming just for clarity.

sample(n, shuffle=True) will be equivalent to NetLogo's up-to-n-of. @quaquel I know you hate NetLogo with all your hearth, but sometimes you can learn a lot from them ;).

I hate the language, but, yes, we can pick up useful ideas and give them a better name. sample is much better than that weird construct with hyphens in the name 😉.

quaquel avatar Aug 30 '24 05:08 quaquel

I was thinking max, limit, ceiling or at_most.

Any suggestions (either these or another)?

EwoutH avatar Aug 30 '24 06:08 EwoutH

I like at_most the best. It conveys that "n" can be arbitrary large, but must the number of returned agents must not match. It also sort of implies that you first apply a filter and then take a sample. And it also makes the rounding clear for fractions. So 1/3 of 5 (1.67) will be 1 agent, otherwise it would be more than 1/3.

Corvince avatar Aug 30 '24 07:08 Corvince

So 1/3 of 5 (1.67) will be 1 agent, otherwise it would be more than 1/3.

Currently it does round, do you think it shouldn't?

EwoutH avatar Aug 30 '24 07:08 EwoutH

If its an upper limit I think it should always round down/floor

Corvince avatar Aug 30 '24 07:08 Corvince

Difficult one. Because if you describe it as "selecting a fraction" I would expect it to select the closest match.

I think in many practical scenarios the closest selection to the fraction you wanted is most logical.

EwoutH avatar Aug 30 '24 07:08 EwoutH

If we go with at_most, it should round down in the case of fractions. Otherwise, the name and behavior don't match.

quaquel avatar Aug 30 '24 07:08 quaquel

Valid argument for "selecting a fraction", but for selecting "at most" 33% I would not expect it to select 40%

Corvince avatar Aug 30 '24 07:08 Corvince

If we go with at_most, it should round down in the case of fractions. Otherwise, the name and behavior don't match.

Exactly. Thats why I think its a good name (if we floor), because people will always have different expectations for "selecting a fraction" with respect to rounding.

Corvince avatar Aug 30 '24 07:08 Corvince

I renamed max to at_most, made sure it rounded down, and updated the tests.

EwoutH avatar Aug 30 '24 08:08 EwoutH

PR description is updated, including the usage examples

EwoutH avatar Aug 30 '24 08:08 EwoutH

@projectmesa/maintainers ready to go? (would like to merge myself)

EwoutH avatar Aug 30 '24 10:08 EwoutH