gcsim
gcsim copied to clipboard
Data Collection Framework
Requirements
- Pluggable (similar to chars, artifacts, weapons)
- gcsim currently takes an approach where we define a tmpl struct that includes default implementations of each routine needed
- WASMable
- I believe this boils down to memory efficiency, we can't push too much back to the WASM controller from the stats ctrl
- There may also be type restrictions here, but I think if it can encode to JSON, we should be okay
- Mem efficient
Initial Draft/Thoughts:
- Register a new Stat type in core, core.addStat
- Stat types should always seek to store as little information as possible. Ideally each stat can be compressed with some sort of accumulator
- If a stat needs to track multiple values, the stat type should define a routine to run at the end of each iteration and simplify what is returned. Ideally this would compress to a single integer value. If necessary this could be an array`
- Stats ought to be a consequence of events in gcsim
- If I’m calculating DPS avg for each character, I would register a new charstat type struct with key “DpsAvg”
- DpsAvg would register an onDamage listener with core. The event handler would track the damage value and character index by calling charstat.push(charIdx, dmg)
- Charstat.push should write to a [][]float64 prop, tracking damage instance values per character
- At the end of a sim_iter, DPS Avg, can run a pre-aggregation to determine avg dps per character for this sim_iter. We do not need to keep each individual value in memory and return it.
- To calc the final dps avg, we simply take the avg of the avgs
So what I had in mind was probaby slightly different. Rather than registering a new structure, I was thinking more of something like the following:
type StatCollector interface {
Init() error //to be called at setup (if needed)
Aggregate() SomeStruct
}
Then in the core, similar to the character function here: https://github.com/genshinsim/gcsim/blob/5f577ba277e2d029dab170070cf3fc1e0b5a35f6/pkg/core/register.go#L14
We could have something like:
type NewStatCollector func(core *Core) error
func RegisterStatCollector(name string, f NewStatCollector) {
//do stuff
}
In this way, we can create individual packages that fulfill the StatCollector
interface that can then register itself with the core (just like how character and weapons does it).
When the core gets initialized, it can then call all the StatCollector
that have been registered, passing in it self. Each StatCollector
, now having access to *core.Core
can then use the event system to register listener to whatever event it is interested in and handle whatever data collection in whatever format it so chooses
The finally, when the sim finishes, the sim can then call the Aggregate()
function which will return some sort of structure that contains the data is StatCollector
has collected. I'm not really sure what this would look like though...
Note that in the above I have StatCollector
registering itself to each simulation effectively. There's no reason why the same StatCollector
can't be used for multiple runs though. Since the implementation is left up to each collector, each collector just has to make sure it has some ways of tracking and handling different sims potentially running at the same time
(From a discussion on discord)
thinking about the stat collection a bit more, here is what I'd probably lean towards doing (granted I have not familiarized myself with the code base still):
Create Stat
objects that subscribe to events and will perform whatever calculation you want for a given iteration. These Stat
objects are isolated/instantiated for each iteration. With this you have a generic way to output whatever stats you want for each iteration. IE: have some TotalDamageStat
that'll subscribe to damage events and create a sum of all damage. A set of Stats
running results in a "row" of stats data being output with each iteration.
From here then have two options each with their own pros/cons:
Option A: Define Aggregator
classes that are instantiated at simulation start. As iterations complete, the stats output/row of data gets passed to your aggregators where you merge and reduce the results. IE: You can have an AvgTotalDamageAggregator
which will take each iteration's TotalDamageStat
and "merge" them into a single average value once all iterations are complete (there is a way with this approach you can define generic aggregators to maximize reusability, but that's an extra complexity I won't go into here).
Option B: Retain each row of stats for each iteration in your final simulation output. If you run 1000 iterations then you final output should have 1000 rows of stats data. Then whatever reads the simulation results can then run whatever calculations it wants on the stats dataset bundled with these results. IE: For average total damage, define some calc that just loops over all rows and averages the total damage.
Of these two options, the 2nd is the easier one to implement. It also has the added advantage of always having the raw data available. If you add more calculations down the line, as long as they do not depend on new Stats
, you can run those calculations on existing results without having to re-simulate. The obvious major downside with this is your sim results are now bloated with all this extra data which scales with how many iterations are ran. With the 1st you do not have this problem, just added complexity of implementation and results will be limited to whatever aggregators existed at the time of execution.
(From a discussion on discord)
thinking about the stat collection a bit more, here is what I'd probably lean towards doing (granted I have not familiarized myself with the code base still):
Create
Stat
objects that subscribe to events and will perform whatever calculation you want for a given iteration. TheseStat
objects are isolated/instantiated for each iteration. With this you have a generic way to output whatever stats you want for each iteration. IE: have someTotalDamageStat
that'll subscribe to damage events and create a sum of all damage. A set ofStats
running results in a "row" of stats data being output with each iteration.From here then have two options each with their own pros/cons:
Option A: Define
Aggregator
classes that are instantiated at simulation start. As iterations complete, the stats output/row of data gets passed to your aggregators where you merge and reduce the results. IE: You can have anAvgTotalDamageAggregator
which will take each iteration'sTotalDamageStat
and "merge" them into a single average value once all iterations are complete (there is a way with this approach you can define generic aggregators to maximize reusability, but that's an extra complexity I won't go into here).Option B: Retain each row of stats for each iteration in your final simulation output. If you run 1000 iterations then you final output should have 1000 rows of stats data. Then whatever reads the simulation results can then run whatever calculations it wants on the stats dataset bundled with these results. IE: For average total damage, define some calc that just loops over all rows and averages the total damage.
Of these two options, the 2nd is the easier one to implement. It also has the added advantage of always having the raw data available. If you add more calculations down the line, as long as they do not depend on new
Stats
, you can run those calculations on existing results without having to re-simulate. The obvious major downside with this is your sim results are now bloated with all this extra data which scales with how many iterations are ran. With the 1st you do not have this problem, just added complexity of implementation and results will be limited to whatever aggregators existed at the time of execution.
Ok so if I understand it correctly, each Stat
object (what I called StatCollector
above) would collect data per iteration into a basic row format as opposed to some kind of structured data. Obviously each column in the row would correspond to something that each Stat
object (or StatCollector
) would understand.
I guess my question is, is there any advantage to using a basic row format instead of some kind of structured data? The row format would not be understandable outside of each Stat
object anyways.
Instead we can just have the Aggregator
method take any
as param and leave it up to each Stat
object to deal with whatever format it's in?
I guess my question is, is there any advantage to using a basic row format instead of some kind of structured data?
Both a generic row or some structured output would work, really just a design decision on whatever is more maintainable relative to the rest of the codebase. Regardless, the output of all Stat
/StatCollector
instances should ideally be coalesced into a single entity be it a generic row or some predefined struct. The Aggregator
/calculations should be able to take the entire datum as input and not have any context/awareness of which Stat
produced which values.
Aggregator
/calc operations should be considered independent from the Stat
/StatCollector
. Changes to the Stat
should not impact your Aggregator
/calc in any way as long as the data and structure of the final output is maintained. Conversely, decisions on what to calculate should not be tied to how the data is produced. IE: I should be able to create a calc where I can determine the mean, min, max, pXX, etc of total damage per particle even if total_damage
and particle_count
was produced by different Stat
.
Instead we can just have the
Aggregator
method takeany
as param and leave it up to eachStat
object to deal with whatever format it's in?
This may be golang specific and I am not sure how golang handles dynamic types, but in theory that should be fine as long as 1) the Aggregator
/calc has enough information to correctly read the data off the rows and 2) does not require the Aggregator
/calc to reference the Stat
in order to extract the necessary information.
@unleashurgeek is this done via #1359?
Closing this. The design here is basically stale. Can open new issue for any future improvements.