gcsim Data Collection Framework

Requirements

Pluggable (similar to chars, artifacts, weapons)
- gcsim currently takes an approach where we define a tmpl struct that includes default implementations of each routine needed
WASMable
- I believe this boils down to memory efficiency, we can't push too much back to the WASM controller from the stats ctrl
- There may also be type restrictions here, but I think if it can encode to JSON, we should be okay
Mem efficient

Initial Draft/Thoughts:

Register a new Stat type in core, core.addStat
- Stat types should always seek to store as little information as possible. Ideally each stat can be compressed with some sort of accumulator
- If a stat needs to track multiple values, the stat type should define a routine to run at the end of each iteration and simplify what is returned. Ideally this would compress to a single integer value. If necessary this could be an array`
Stats ought to be a consequence of events in gcsim
- If I’m calculating DPS avg for each character, I would register a new charstat type struct with key “DpsAvg”
- DpsAvg would register an onDamage listener with core. The event handler would track the damage value and character index by calling charstat.push(charIdx, dmg)
  - Charstat.push should write to a [][]float64 prop, tracking damage instance values per character
At the end of a sim_iter, DPS Avg, can run a pre-aggregation to determine avg dps per character for this sim_iter. We do not need to keep each individual value in memory and return it.
- To calc the final dps avg, we simply take the avg of the avgs

May 18 '22 19:05 jordanlovato

So what I had in mind was probaby slightly different. Rather than registering a new structure, I was thinking more of something like the following:

type StatCollector interface {
    Init() error //to be called at setup (if needed)
    Aggregate() SomeStruct
}

Then in the core, similar to the character function here: https://github.com/genshinsim/gcsim/blob/5f577ba277e2d029dab170070cf3fc1e0b5a35f6/pkg/core/register.go#L14

We could have something like:

type NewStatCollector func(core *Core) error

func RegisterStatCollector(name string, f NewStatCollector) {
   //do stuff
}

In this way, we can create individual packages that fulfill the StatCollector interface that can then register itself with the core (just like how character and weapons does it).

When the core gets initialized, it can then call all the StatCollector that have been registered, passing in it self. Each StatCollector, now having access to *core.Core can then use the event system to register listener to whatever event it is interested in and handle whatever data collection in whatever format it so chooses

The finally, when the sim finishes, the sim can then call the Aggregate() function which will return some sort of structure that contains the data is StatCollector has collected. I'm not really sure what this would look like though...

Note that in the above I have StatCollector registering itself to each simulation effectively. There's no reason why the same StatCollector can't be used for multiple runs though. Since the implementation is left up to each collector, each collector just has to make sure it has some ways of tracking and handling different sims potentially running at the same time

May 26 '22 02:05 srliao

(From a discussion on discord)

thinking about the stat collection a bit more, here is what I'd probably lean towards doing (granted I have not familiarized myself with the code base still):

Create Stat objects that subscribe to events and will perform whatever calculation you want for a given iteration. These Stat objects are isolated/instantiated for each iteration. With this you have a generic way to output whatever stats you want for each iteration. IE: have some TotalDamageStat that'll subscribe to damage events and create a sum of all damage. A set of Stats running results in a "row" of stats data being output with each iteration.

From here then have two options each with their own pros/cons:

Option A: Define Aggregator classes that are instantiated at simulation start. As iterations complete, the stats output/row of data gets passed to your aggregators where you merge and reduce the results. IE: You can have an AvgTotalDamageAggregator which will take each iteration's TotalDamageStat and "merge" them into a single average value once all iterations are complete (there is a way with this approach you can define generic aggregators to maximize reusability, but that's an extra complexity I won't go into here).

Option B: Retain each row of stats for each iteration in your final simulation output. If you run 1000 iterations then you final output should have 1000 rows of stats data. Then whatever reads the simulation results can then run whatever calculations it wants on the stats dataset bundled with these results. IE: For average total damage, define some calc that just loops over all rows and averages the total damage.

Of these two options, the 2nd is the easier one to implement. It also has the added advantage of always having the raw data available. If you add more calculations down the line, as long as they do not depend on new Stats, you can run those calculations on existing results without having to re-simulate. The obvious major downside with this is your sim results are now bloated with all this extra data which scales with how many iterations are ran. With the 1st you do not have this problem, just added complexity of implementation and results will be limited to whatever aggregators existed at the time of execution.

Sep 12 '22 02:09 unleashurgeek

(From a discussion on discord)

thinking about the stat collection a bit more, here is what I'd probably lean towards doing (granted I have not familiarized myself with the code base still):

Create Stat objects that subscribe to events and will perform whatever calculation you want for a given iteration. These Stat objects are isolated/instantiated for each iteration. With this you have a generic way to output whatever stats you want for each iteration. IE: have some TotalDamageStat that'll subscribe to damage events and create a sum of all damage. A set of Stats running results in a "row" of stats data being output with each iteration.

From here then have two options each with their own pros/cons:

Option A: Define Aggregator classes that are instantiated at simulation start. As iterations complete, the stats output/row of data gets passed to your aggregators where you merge and reduce the results. IE: You can have an AvgTotalDamageAggregator which will take each iteration's TotalDamageStat and "merge" them into a single average value once all iterations are complete (there is a way with this approach you can define generic aggregators to maximize reusability, but that's an extra complexity I won't go into here).

Option B: Retain each row of stats for each iteration in your final simulation output. If you run 1000 iterations then you final output should have 1000 rows of stats data. Then whatever reads the simulation results can then run whatever calculations it wants on the stats dataset bundled with these results. IE: For average total damage, define some calc that just loops over all rows and averages the total damage.

Of these two options, the 2nd is the easier one to implement. It also has the added advantage of always having the raw data available. If you add more calculations down the line, as long as they do not depend on new Stats, you can run those calculations on existing results without having to re-simulate. The obvious major downside with this is your sim results are now bloated with all this extra data which scales with how many iterations are ran. With the 1st you do not have this problem, just added complexity of implementation and results will be limited to whatever aggregators existed at the time of execution.

Ok so if I understand it correctly, each Stat object (what I called StatCollector above) would collect data per iteration into a basic row format as opposed to some kind of structured data. Obviously each column in the row would correspond to something that each Stat object (or StatCollector) would understand.

I guess my question is, is there any advantage to using a basic row format instead of some kind of structured data? The row format would not be understandable outside of each Stat object anyways.

Instead we can just have the Aggregator method take any as param and leave it up to each Stat object to deal with whatever format it's in?

Sep 13 '22 13:09 srliao

I guess my question is, is there any advantage to using a basic row format instead of some kind of structured data?

Both a generic row or some structured output would work, really just a design decision on whatever is more maintainable relative to the rest of the codebase. Regardless, the output of all Stat/StatCollector instances should ideally be coalesced into a single entity be it a generic row or some predefined struct. The Aggregator/calculations should be able to take the entire datum as input and not have any context/awareness of which Stat produced which values.

Aggregator/calc operations should be considered independent from the Stat/StatCollector. Changes to the Stat should not impact your Aggregator/calc in any way as long as the data and structure of the final output is maintained. Conversely, decisions on what to calculate should not be tied to how the data is produced. IE: I should be able to create a calc where I can determine the mean, min, max, pXX, etc of total damage per particle even if total_damage and particle_count was produced by different Stat.

Instead we can just have the Aggregator method take any as param and leave it up to each Stat object to deal with whatever format it's in?

This may be golang specific and I am not sure how golang handles dynamic types, but in theory that should be fine as long as 1) the Aggregator/calc has enough information to correctly read the data off the rows and 2) does not require the Aggregator/calc to reference the Stat in order to extract the necessary information.

Sep 13 '22 15:09 unleashurgeek

@unleashurgeek is this done via #1359?

Jul 05 '23 07:07 k0l11

Closing this. The design here is basically stale. Can open new issue for any future improvements.

Jul 05 '23 15:07 srliao

gcsim gcsim copied to clipboard

Data Collection Framework

gcsim
gcsim copied to clipboard