osu Remove combo scaling from Aim and Speed from osu! performance calculation

This proposal removes combo scaling from the Aim and Speed skills in osu!, and overhauls their miss penalties. Currently, the performance calculator does not know where the player missed in a map - for example whether it was on the hardest section or an easier section. Therefore, the most fair way to treat misses is to assume they are equal across the map, i.e. 500x/1000x 1 miss vs. 750x/1000x 1 miss should be weighted similarly.

Misses, in general, have been made harsher. The initial miss is going to have a harsher penalty to differentiate between FCs and non-FCs. The scaling is no longer based on the total object count, but instead the number of relevant difficult strains. This is measured by counting the number of strains, dividing them by the top strain and raising to power 4. This results in maps with consistent difficulty being more lenient, and long maps with a short spike in difficulty being treated similarly to shorter maps.

There have been concerns about consistency supposedly mattering less, and I don't think that's the case. Miss counts are a fine metric to rely on here, and again, using combo here doesn't make much sense if we can't tell what parts of the map are hard from performance side, or where the player has missed. I hope that the strain count stuff described in the miss penalty change above can at least rest some of the concerns :P

Values (and a very rough Q&A thingy) can be found at https://pp.huismetbenen.nl/rankings/players/apollo_visual

Dec 29 '21 20:12 apollo-dw

I don't think this makes much sense It seems farming pp maps higher than your skill level is made easier Very subjective opinion maybe

Dec 30 '21 15:12 PercyDan54

I don't think this makes much sense It seems farming pp maps higher than your skill level is made easier Very objective opinion maybe

just for clarity, here are the effects the miss penalty are having on the maps here compared to their FC values (no misses):

make a move 163.4 -> 121.9 (-41.5, 74.6%)
hikari 176.6 -> 119.9 (-56.7, 67.9%)
stay 154.5 -> 110.9 (-43.6, 71.8%)

these plays are atleast a full star above one of the FCs there (make a move is 6.11*, and harumachi clover expert is 4.74*), so you can make a judgement there whether those plays with misses are a similar skill level to the harumachi clover FC. also worth noting that farm maps will be farm either way :P

Dec 31 '21 00:12 apollo-dw

Here's a passer-by not really familiar with performance calculation algorithms, but here are some potential issues I see (just in case they're missed):

(1) How would sb counts be evaluated? It seems that you use effectiveMissCount, but from what I see it returns the minimum number of combobreaks possibie (if > number of misses), which may severely underestimate sb counts on slider heavy maps (e.g. I may have a score with 550/1000 combo, 0xmiss and 10xsb, but the effectiveMissCount returns 1). It might be possible to estimate sb count by number of 100s, but it's another story to estimate how many 100s are originated from sb...

(2) Following (1), because the effectiveMissCount is not a continuous function of maxcombo, the pp value may change abruptly (~10% on several maps I tried) with MaxCombo changing by ±1 (again, assuming 0xmiss). This may cause some confusion and also not really good imo (the miss penalty was usually buried in combo scaling before so it wasn't obvious like this)

(3) A hard threshold (66%) is used in calculating the strainCount, I'm not sure how "hackable" is this (like it might be possible to create a map with most strain values locating around 70% of maximum strain, and it will be considerably overrated; similar for being underrated). A continuous weighting curve on this would be safer, though idk if it's difficult to implement

Anyway it's great to see scores with unlucky combobreaks in the middle getting what they deserve. Good job!

Jan 02 '22 21:01 cihe13375

(1) How would sb counts be evaluated? It seems that you use effectiveMissCount, but from what I see it returns the minimum number of combobreaks possibie (if > number of misses), which may severely underestimate sb counts on slider heavy maps (e.g. I may have a score with 550/1000 combo, 0xmiss and 10xsb, but the effectiveMissCount returns 1). It might be possible to estimate sb count by number of 100s, but it's another story to estimate how many 100s are originated from sb...

(2) Following (1), because the effectiveMissCount is not a continuous function of maxcombo, the pp value may change abruptly (~10% on several maps I tried) with MaxCombo changing by ±1 (again, assuming 0xmiss). This may cause some confusion and also not really good imo (the miss penalty was usually buried in combo scaling before so it wasn't obvious like this)

These two issues are out of the scope of this pull request. This pr is for removing the combo scaling. The issues you mentioned are already present in the live version, this pr is not bringing them in.

Jan 03 '22 09:01 aticie

(3) A hard threshold (66%) is used in calculating the strainCount, I'm not sure how "hackable" is this (like it might be possible to create a map with most strain values locating around 70% of maximum strain, and it will be considerably overrated; similar for being underrated). A continuous weighting curve on this would be safer, though idk if it's difficult to implement

i've implemented this this with massive support from @Luminiscental, thanks! point 1 is unfortunately unsolveable without a sliderbreak metric, and @stanriders might be able to give some insight into point 2

the strain count is now weighted against the top strain with a power so that low strain counts have less weighting, should take care of potential abuse cases now O_O

Jan 04 '22 19:01 apollo-dw

For a visualization and explanation of the formula, see: https://www.desmos.com/calculator/im16otcwnc.

Essentially, the arbitrary 0.66 value from the original implementation can be varied, and we take the average as it does (weighting low values harshly).

Jan 04 '22 21:01 Luminiscental

effectiveMissCount not being a continuous function can be easily fixed - right now it's being floored to make misses simpler (concept of having 1.472 misses is a weird one after all)

Jan 05 '22 11:01 stanriders

Spreadsheet of PP changes from this: https://docs.google.com/spreadsheets/d/1FQSmD00h1tJKm_KNN62-Ew_kEmCf_OUhAqdVCGMbNW4/edit#gid=1165396648

Jan 06 '22 16:01 smoogipoo

I am now using per-note strain count rather than per-section strain count. This should allow for something like using difficulty strain count for length bonus rather than straight object count (which is what I'm intending to experiment with next)

I spoke with emu regarding https://github.com/ppy/osu/pull/15035 since both PRs include (different) methods of counting strains - we decided that we can't combine attributes since his sum formula needs to work a specific way for his intended purpose. With that, the blocker can be taken off.

Feb 15 '22 16:02 apollo-dw

Apologies for the delay, here's a second spreadsheet of these changes: https://docs.google.com/spreadsheets/d/1zE50eTVDByGBhdy-lhtN9yUwlAAM15CbbnpsonJpCt0/edit#gid=1165396648

Mar 07 '22 08:03 smoogipoo

@smoogipoo this PR is considered approved by osu pp committee and is ready for merge

May 27 '24 16:05 stanriders

@ppy/team-client pinging on request of bdach (sorry for double ping smoogi, not intended) to make it clear these changes are ready but also to provide some direction.

This will require 2 new difficulty attributes as a bare minimum to deploy for lazer, I don't know if it's viable to add any new attributes to the database but they're quite small so I'm hoping that it might be possible even if just for this rework.

If the added storage requirements are not negotiable then it will have to wait for real time difficulty calculation since there's no way to proceed without the new attributes.

To deploy to stable, it's going to require that the score performance processor in osu-queue-score-statistics is in use to prevent the scenario where a lower pp score will overwrite since the old infrastructure is overwriting based on score. I know this isn't new information, I'm just reminding.

I'm hoping with this rework being approved alongside a few other reworks being stacked up for review that focus on solving pp infra concerns and eventually running recalculations will be higher priority than so far. This rework is pretty huge in value and in general is resulting in large changes for a lot of users, so it would be great if we could see it deployed and recalculated sooner rather than later. I'm highly aware the infra work is no small task but it would be nice to see it worked on!

It'd be nice if we could get some time estimation on how viable it is for these changes to be deployed and recalculated, as if it's going to take a while then we may focus on getting a few other smaller reworks merged in if a recalculation isn't viable soon - equally, I'm keen to push specifically these changes to deploy & recalculation as soon as realistically possible.

May 27 '24 16:05 tsunyoku

to prevent the scenario where a lower pp score will overwrite since the old infrastructure is overwriting based on score

I think this is the blocker for the time being. @peppy and @bdach know better than me at this point the process by which scores are un-preserved and deleted.

May 27 '24 22:05 smoogipoo

Deployment considerations as far as I can tell:

Two new difficulty attributes per beatmap/mod combo. This is:
- 2 rows * 13 bytes per row
- for 54 mod combos for each beatmap
- for ~111700 ranked beatmaps (as per data.ppy.sh dumps)
- totalling ~156 MB of extra storage overhead
- plus the need to do a full run of osu! server-side diffcalc for all beatmaps
For the "score overwriting" situation:
- lazer scores are in a better situation since right now - to the best of my knowledge - marking scores as non-preserved and deleting them still isn't online, and even if it was, it'd be doing the right thing.
- stable is indeed the problem as adding highest pp to score preservation criteria probably requires changes to web-10 / osu-performance for quickest outcome (lazer infra just imports across whatever web-10 decides to preserve). Or just shutting off osu-performance with all of what that would involve. Would need @peppy assessment on feasibility and/or willingness to do this.
- Note that stable is already using the osu-queue-score-statistics processor for user total PP which means that it correctly handles multiple preserved plays correctly (it'll pick the highest PP one, not the highest score one). To make this work for stable we actually just need the highest-PP-but-not-highest-score score to just not disappear at the web-10 end.

May 28 '24 07:05 bdach

What's going on here? This still has conflicts. Should I look to resolve them or is someone going to handle that?

Oct 03 '24 13:10 peppy

To play devil's advocate these conflicts 100% weren't there before. They're probably from https://github.com/ppy/osu/pull/29291.

Oct 03 '24 13:10 bdach

Yeah, those conflicts were fresh from recent refactors (not the first time these refactors have caused me issues 🫠). I've resolved them now.

Oct 03 '24 15:10 tsunyoku

!diffcalc

Oct 07 '24 11:10 bdach

Target: https://github.com/ppy/osu/pull/16280 Spreadsheet: https://docs.google.com/spreadsheets/d/1qi2pqXBmTC25OAcs6dIWtgKHQVPYQ86rMXxbmjEn0AU/edit

Oct 08 '24 01:10 github-actions[bot]