ebisu
ebisu copied to clipboard
Dev diary: single-atom Beta power law
I was doodling and realized that while everything in the single-atom Ebisu case (v2 and v3) takes pNow = p**(elapsed / t)
for an atom parameterized by [a, b, t]
(where the probability recall at time t
is assumed to be a Beta(a, b)
random variable, that is, probability recall at time t ~ Beta(a, b)
), there's nothing stopping us from changing thing.
The p**elapsed
exponentiation is why our Beta random variable decays via an exponential, and we can very very easily get a single Beta to exhibit power-law forgetting by saying pNow = p**log2(1 + elapsed / t)
. Both these expressions share some nice properties:
- both
p**(elapsed / t)
andp**log2(1 + elapsed / t)
both are 0.5 whent==elapsed
anda=b
, i.e.,t
remains a halflife of the new expression - both are 1.0 as
t
β 0 and asymptotically approach 0 ast
grows very large.
The difference of course is that the power-law p**log2(1 + elapsed / t)
decays muuuch slower than the exponential decay. It turns out that it's very easy to reuse the existing Ebisu v2 Beta+exponential library to do this power-law scheme, since basically pNow = p**f(elapsed)
, i.e., the Beta random variable is raised to some powerβelapsed
for exponential decay, log2(elapsed...)
for power-law decay.
I have a little script that demonstrates this: https://github.com/fasiha/ebisu/blob/v3-release-candidate/scripts/betapowerlaw.py
To run this,
- create a venv or Conda env,
- install dependencies:
python -m pip install numpy scipy pandas matplotlib tqdm ipython "git+https://github.com/fasiha/ebisu@v3-release-candidate"
, - then clone this repo and check out the release candidate
rc1
branch:git clone https://github.com/fasiha/ebisu.git && cd ebisu && git fetch -a && git checkout v3-release-candidate
, - download my Anki reviews database: collection-no-fields.anki2.zip, unzip it, and place
collection-no-fields.anki2
in thescripts
folder so the script can find it - start ipython:
ipython
- run the script:
%run scripts/betapowerlaw.py
. This will produce some text/figures
Now you can follow along:
In [3]: predictRecall((2, 2, 10), 100) # THIS IS THE NEWLY DEFINED FUNCTION IN betapowerlaw.py
Out[3]: 0.17014120906719243
In [4]: ebisu2.predictRecall((2,2,10), 100, exact=True)
Out[4]: 0.03846153846153846
In [5]: predictRecall((2, 2, 10), 1000)
Out[5]: 0.07175073430740214
In [6]: ebisu2.predictRecall((2,2,10), 1000, exact=True)
Out[6]: 0.0005711022272986858
Above we compare the predicted recall 10 and 100 halflives:
- power law decay: 17% and 7% respectively
- exponential decay: 4% and 0.06% respectively
Running the script above will generate this chart comparing a few models for a few hundred quizzes in terms of log-likelihood:
I have a very similar script for benchmarking the v3 ensemble-of-Betas algorithm, %run scripts/analyzeHistory.py
will run this and generate this:
In the two charts above, higher is better (higher likelihood). Each point corresponds to the sum of log-likelihoods (product of raw likelihoods) for each quiz for that flashcard. Each is sorted by the worst log-likelihood to the best, and the 125 right-most quizzes are flashcards for which I have no failures.
Looking at these side-by-side:
- the single-Beta power law algorithm is pretty damn good
- the Beta-ensemble is better though. Assuming the best model in both cases is close to optimal, the best v3-ensemble algorithm is 2-3 units of log-likelihood higher than the Beta-power-law algorithm's best scenario.
Both scripts also spit out a text file containing per-flashcard, per-quiz details of what likelihood each model assigned to the current quiz and its current halflife. Looking through these is really interesting because you can see how different models result in very different halflives after each quiz. This also emphasizes why benchmarking algorithms via log-likelihood (see https://github.com/fasiha/ebisu.js/issues/23) is tricky: an easy way to "cheat" is just be overly optimistic because in general failures are quite uncommon and the penalty an algorithm incurs by being very wrong about occasional failures is more than made up by the boost it gets by over-confidently predicting every quiz to be a success. This is really important: an algorithm/model that performs well in terms of sum-of-log-likelihoods doesn't mean it's the best, we have to look at how it handles failures, how it grows halflives after quizzes, if they're reasonable.
So right now I'm not sure what to do π hence this dev diaryβmaybe writing things out will give me some ideas. I could try to see if there are better initial parameters that improve on these. I'm also going to investigate whether the halflives produced by the two algorithms are reasonable (since some apps will no doubt want to do the Anki thing and schedule reviews for when recall probability drops below a threshold).
If it turns out the single-atom Beta power law algorithm is good enough, should I scrap the Beta-ensemble modelβ¦? π!
an easy way to "cheat" is just be overly optimistic because in general failures are quite uncommon and the penalty an algorithm incurs by being very wrong about occasional failures is more than made up by the boost it gets by over-confidently predicting every quiz to be a success.
We can use cross entropy to measure the performance of the model. It cannot be cheated by over-confident predictions, because it measures the difference between prediction distribution and the real distribution.
We can use cross entropy to measure the performance of the model. It cannot be cheated by over-confident predictions, because it measures the difference between prediction distribution and the real distribution
Oh nice! Is there somewhere I can read about this? Does this work for data that's real flashcards? Or do you need to simulate flashcards from a known probability distribution in order to obtain the K-L distance between the "true" prior and the distribution assumed by Ebisu/FSRS?
FSRS Optimizer has used binary cross entropy as the loss funtion:
https://github.com/open-spaced-repetition/fsrs-optimizer/blob/7d69d43d6b8c46fc572793e04716873fc395d366/src/fsrs_optimizer/fsrs_optimizer.py#L252
https://github.com/open-spaced-repetition/fsrs-optimizer/blob/7d69d43d6b8c46fc572793e04716873fc395d366/src/fsrs_optimizer/fsrs_optimizer.py#L1355-L1358
Or do you need to simulate flashcards from a known probability distribution in order to obtain the K-L distance between the "true" prior and the distribution assumed by Ebisu/FSRS?
It's very common to use cross entropy as the loss function in classification tasks: https://machinelearningmastery.com/cross-entropy-for-machine-learning/
Oh I see what you mean π alas in all my examples, what I call "log likelihood" is literally cross-entropy from binary classification: per https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html cross-entropy is sum(y * log(p) + (1 - y) * log(1 - p) for y, p in zip(results, probabilities))
where results
is a list of bools and probabilities
is a list of floats (between 0 and 1), whereas I call just scipy.stats.binom.logpmf
in https://github.com/fasiha/ebisu/blob/8ae2a954d3cb888e69f906a3b16dceec9be1efb4/scripts/analyzeHistory.py#L37-L39 which simplifies to exactly the same function for the Bernoulli case (binary quiz result, i.e., when n=1
) before summing them. In the Bernoulli/binomial quiz case, the "likelihood" is just the probability mass function so the two are identical.
I've definitely noticed that this metric is vulnerable to being misled by overconfident models that assign overoptimistic predictions all the time, like I shared in https://github.com/fasiha/ebisu.js/issues/23#issuecomment-1818202898. Let's carry on discussing that there!
(disclaimer for the following: I am dyscalculic so I stumbled into this one more or less accidentally at 2am while making a combinatorial analogue of Anki that represents concepts latently and separately from their testing contexts. I have a rough mental image of what it's doing but I kind of just tried every single combination of "Math.log", "/", "*", etc. that I could think of and once it started spitting out good numbers I ran with it. I don't know how it works but I'm suspecting it's similar.)
let decayEstimate = -0.5; // gradient over time
let flexibility = 0.095;
// how much it freaks out and adjusts if it gets something wrong
let base = 1.168;
// -- learning --
// first entry is recorded
let $sinceLastSeen = 0.001; // in hours
let $accuracy = 1.0; // between 0 and 1
let impliedDecay = (Math.log($accuracy) / Math.log(base)) / $lastSeen;
// impliedDecay: 0
// i.e, "wow! the user has a perfect memory and will never forget this ever"
let newDecay = average( [decayEstimate, impliedDecay], { ratio: [flexibility, 1 - flexibility] } )
// newDecay: -0.4525
decayEstimate = newDecay;
// ...
// typically after a few repetitions, you will get:
// decayEstimate: -0.35
// -- situation 1: user waited way too long after learning --
let $sinceLastSeen = 24;
let expectedAccuracy = Math.pow(1.168, -0.35 * 24)
// expectedAccuracy: 0.27
// case A (did better than we expected):
let accuracy = 0.84;
// usually was calculated via levenshtein distance, I don't think it matters too much
let impliedDecay = (Math.log(0.84) / Math.log(1.168)) / 24;
// impliedDecay = -0.047
// ...
// -> newDecay = -0.33
// case B (did as well as expected) -> decay remains identical
// case C (did worse than expected) -> decay increases
All that gets stored per concept entry is lastSeen and decay. In retrospect, what I think this is implicitly doing is modelling both maturity/interval, strength of belief updates, cross-entropy loss, and inherent difficulty, using a single parameter. I don't know what the downsides of this approach are but it was fast to calculate, and held up surprisingly well across the ~20 people I tested it on (ranging from beginners to advanced learners who wanted daily practice). I wish I knew how to apply it to that test data.
edit: important thing to note, I optimized the hyperparams here for minimizing boredom, not maximizing accuracy.
@ckoshka thank you so much for sharing! I love these simple, battle-tested models, and I totally adore language learning apps that break the mold π€© Langwitch is such a fresh take, I look forward to experimenting with it!
Can you check if the following is right (I used Python, sorry, if you prefer, I can readily rewrite in JavaScript):
from math import log
initDecay = -0.5
flexibility = 0.095
base = 1.168
def update(decay: float, sinceLastSeen: float, accuracy: float) -> float:
impliedDecay = (log(accuracy) / log(base)) / sinceLastSeen
return decay * flexibility + impliedDecay * (1 - flexibility)
def predict(decay: float, sinceLastSeen: float) -> float:
return base**(decay * sinceLastSeen)
model = initDecay
for sinceLastSeen, accuracy in [(24, .2), (12, .6), (36, .3)]:
prediction = predict(model, sinceLastSeen)
model = update(model, sinceLastSeen, accuracy)
print(
f'{sinceLastSeen} hours: pred={prediction:.2f}. Quiz accuracy={accuracy} β new decay={model}')
That last bit is a little demo that starts with the initial decay and goes through three quizzes, 24 hours, then 12 hours, then 36 hours, updating after each quiz. At each quiz it prints out the prediction and the new decay. This is the output:
24 hours: pred=0.16. Quiz accuracy=0.2 β new decay=-0.4383049072144775
12 hours: pred=0.44. Quiz accuracy=0.6 β new decay=-0.2897170796140843
36 hours: pred=0.20. Quiz accuracy=0.3 β new decay=-0.22242283525112294
If I did the coding right, these numbers hopefully make sense?
I haven't dug enough to know if Langwitch actively prompts users to review a concept or if it's fully passive, but is there an accuracy (returned by predict
above) that's "bad", indicating you should review this?
If the above code is right, it might be straightforward to plug this into the test frameworks @L-M-Sherlock and I both have, and see how this algorithm would have handled a bunch of flashcards I'm familiar with (the same ~380 test cases from my original post). Thanks again!!!
(Apologies for spamming those subscribed π please feel free to mute this issue, I'll be continuing to use it as a dev diary)
I was thinking about how annoying it is that the sum-of-log-likelihoods under-penalizes overconfident modelsβfor example, consider a flashcard for which the student got five successes followed by a failure. A model that predicted 90% for all six of these would score considerably better than a more conservative model that predicted 60% for all six:
In [10]: from scipy.stats import bernoulli
In [16]: bernoulli.logpmf(1, 0.9) * 5 + bernoulli.logpmf(0, 0.9)
Out[16]: -2.829387671283177
In [17]: bernoulli.logpmf(1, 0.6) * 5 + bernoulli.logpmf(0, 0.6)
Out[17]: -3.4704188507041085
In the sum-of-log-likelihood metric, higher is better. (I'm being very extra by importing bernoulli
just to evaluate it's density π logpmf = lambda result, p: log(p if result else 1-p)
does exactly the same thing of course.)
I feel like the conservative model that predicts 60% is "better" here and I'd prefer a metric that handled this.
So I was reading up about how machine learning folks handle unbalanced data, which is basically what we have here: failures being a lot less common than successes is what causes this problem, and thanks to @L-M-Sherlock's very kind pointer I was able to understand that what I called "sum-of-log-likelihoods" was called cross-entropy π. So a few years ago a user on the Data Science Stack Exchange pointed to a paper on Arxiv introducing a focal loss metric, a modification to cross-entropy that appeared to ameliorate the impact of unbalanced data, in an intriguing way. Rather than the score of a single observation being
log(p if result else 1-p)
Lin et al. propose
log(p**((1-p)**gamma)) if result else (1-p)**(p**gamma))
for some gamma >= 0
. When gamma is 0, you get standard cross-entropy (also known as the Bernoulli distribution's log-probability mass function). But for gamma>0
(they recommend gamma=2
), this behaves quite interestingly: it ignores easily-predicted data and heavily focuses on "hard, misclassified examples".
Check it out:
from math import log
def bernoulliLogProbabilityFocal(result: bool, p: float, gamma: float = 2) -> float:
assert 0 <= p <= 1
assert 0 <= gamma
focalP = p**((1 - p)**gamma)
focalQ = (1 - p)**(p**gamma)
return log(focalP if result else focalQ)
and
In [20]: bernoulliLogProbabilityFocal(True, 0.9, 2)*5 + bernoulliLogProbabilityFocal(False, 0.9, 2)
Out[20]: -1.8703619511080685
In [21]: bernoulliLogProbabilityFocal(True, 0.6, 2)*5 + bernoulliLogProbabilityFocal(False, 0.6, 2)
Out[21]: -0.7385251624874881
With the gamma=2
focal loss, the example we saw before, five successes and one failure,
- the overconfident model that predicted 90% for each has a score -1.9 (previously -2.8 under cross entropy)
- whereas the conservative model that predicted 60% has a score -0.7 (previously -3.5 under cross entropy)
The important thing to see here is that the rankings have switched. The conservative model is better (higher number). Hmmmm, this might be what we want!
So last time we compared a few parameterizations of the v3-ensemble algorithm (an ensemble of weighted Beta distributions over logarithmically-increasing halflives, all decaying exponentially) and the single-Beta power law algorithm. I wanted to find the "best" parameterization for both algorithms for my training sample of ~380 flashcards (~4700 reviews):
- for the v3-ensemble (see this branch): initial alpha/beta, initial halflife, and firstWeight
- for the Beta-power-law: initial alpha/beta and initial halflife
I did a quick 2D grid search varying alpha/beta and halflife for both models to see what parameters led to the best sum of focal loss for all flashcards, all reviews in my test set:
Beta power law:
v3-ensemble, for firstWeight=0.5
(I also tried 0.7 and 0.9 but these were much worse than 0.5):
So these tell us the "optimal" parameters for each of these algorithms that yield the best focal loss summed over all flashcards, all quizzes. The quotes around "optimal" are very heavy because this is such a rough grid search, and who knows maybe focal loss isn't what we'll want to use anyway but let's see where it leads us.
Picking some parameters around the optima for both algorithms and remaking the charts similar to the first post above yields:
Umm, I'm not really sure how focal loss can go positive. It could be because I adapted it from its original Bernoulli-probability-mass-function setting to the binomial's and the fuzzy-binary probability distribution that Ebisu uses (I map my flashcard history from Anki's 1-4 ranks to binomial (4=easy=2 out of 2) and fuzzy-binary (1=hard=q0=0.2
)) and maybe that's causing some issues with some probabilities becoming non-probabilities and exceeding 1 (so log goes positive). I'll dig into this more, but I am hoping the impact of any bugs I find is small.
But assuming this is reasonable, the two algorithms perform surprisingly similar. The best parameterization for either doesn't really outstrip the other.
Of course I then wanted to see what halflives these algorithms output. For each of the 382 flashcards I can take the final model after the last quiz and compute its 50-percentile decay time (halflife) and 80-percentile, and plot it:
It's clear that the v3-ensemble algorithm behaves a lot better than the Beta-power-law algorithm when it comes to halflives. The v3-ensemble final halflives are between a few thousand and a few tens of thousands of hours, and the 80-percentile times are all uniformly a bit less than that. The beta-power-law however has enormous halflives up to 10^10 hours (one million yearsβ¦): for a hundred out of 382 flashcards the Beta-power-law halflife is somewhat comparable to the v3-ensemble halflife but the rest of them are considerably higher. The 80-percentile are a lot more reasonable.
I don't think this is a major condemnation of the Beta-power-law model (though I might be biased, this is after all its dev diary π). There is literally no serious bound on the power law's recall probability: p**log(1 + elapsedTime)
decays verrrrrrry slowly for all time, whereas the v3-ensemble's recall probability is a power-law up to the last atom's halflife (10,000 hours) and then collapses into an exponential. The halflife behavior here honestly makes sense.
In terms of usability, I think we could very reasonably build Anki-style quiz apps that schedule reviews when the recall probability drops to 80% or 70%? (I personally don't like such apps, I prefer apps that calculate which card to review now when I'm free, on my schedule, not on the app's schedule, so I could use some feedback).
And of course both algorithms output reasonable probability of recall for each flashcard's review. I'll be digging into the individual reviews next and seeing if I prefer one or the other algorithm, now that I have a metric (focal loss) that I think helps deal with unbalanced data.
The code to generate all these charts is in this branch. Setup instructions are in the top of this, then
cd ebisu # make sure you're in the right branch
python scripts/analyzeHistory.py # saves the last plot above, for v3-ensemble
python scripts/betapowerlaw.py # saves the second-to-last plot above, beta-power-law
python scripts/compareEnsemblePowerlawHalflives.py # generates halflife/80-percentile plots
If you run these in ipython
you can zoom and analyze the plots. Both scripts analyzeHistory.py
and betapowerlaw.py
have a GRID_MODE
boolean that you can set to True
to generate the 2D heat maps.
I was thinking about how annoying it is that the sum-of-log-likelihoods under-penalizes overconfident modelsβfor example, consider a flashcard for which the student got five successes followed by a failure. A model that predicted 90% for all six of these would score considerably better than a more conservative model that predicted 60% for all six:
I figure it out. The main divergence of our perspectives is that Ebisu considers the probability of recall in card level, but FSRS considers the P(recall) in review history level. I built a dataset here: https://huggingface.co/datasets/open-spaced-repetition/fsrs-dataset. I hope it would be helpful to your development and research.
The main divergence of our perspectives is that Ebisu considers the probability of recall in card level, but FSRS considers the P(recall) in review history level
@L-M-Sherlock yeah this is a great observation. I still am not exactly sure how you convert Ebisu's predictRecall
predictions for each card review to the benchmarks in https://github.com/fasiha/ebisu.js/issues/23#issuecomment-1817448306 but I'm sure you're taking care to avoid the problems with cross-entropy that I'm finding pop up when handling unbalanced data.
Continuing with the dev diary for this single-Beta power-law model. I found the reason the focal loss was going positive (since focal loss is in the log-domain, this means that probability > 1 π): I had a typo in the noisy-binary focal loss function, I was mixing p
vs (1-p)
in the wrong way and fixed it in https://github.com/fasiha/ebisu/commit/6d933d0b1b736f4992b3b002653116e7748e91b1 and https://github.com/fasiha/ebisu/commit/8cd4f0859e640d98146e88862a3339fe5efcb19a.
This changes the results of the grid search just a little, and I confirmed that if I map my Anki history to just binary (1
and 2
= fail, 3
and 4
= pass), the overall shape of this graph doesn't really change, giving us confidence that the extensions I've made taking the focal loss from modeling Bernoulli data to binomial/noisy-binary are valid.
So it appears that the bug in focal loss hasn't destroyed our nice algorithm here π π!
I've been digging into the predictions and halflives generated by this algorithm and I like them. The v3-ensemble model (an ensemble of several Beta random variables at different halflives, each with exponential decay) produces different predictions and posterior halflives compared to this single-Beta power law when it comes to failures, early successes, and surprise long successes, but I think the power law more quickly reacts? In the table below (sorry for how big it is, if possible please read it on desktop), I have the results for flashcard 1300038030806 (in my Anki database, linked in the first post). The results for each quiz are 1-4:
- "fail", a binary quiz, 0 out of 1 point
- "hard", a noisy-binary quiz, result=1, q1 = 1, q0 = 0.2
- "good", a binary quiz, 1 out of 1 point
- "easy", a binomial quiz, 2 out of 2 points.
For each quiz, I show
- quiz number,
- quiz result,
- how many hours elapsed since it was last studied,
- power law (initial alpha=beta=1.25 and initial halflife = 125 hours) probability of recall at the time of quiz,
- power law posterior halflife after update,
- power law, posterior time to 80% probability of recall,
- power law focal loss for this quiz (higher is better)
- cumulative sum of focal loss,
- and similarly for v3-ensemble (
firstHalflife=50, firstWeight=0.5, initialAlphaBeta=1.25, lastHalflife=10e3
) algorithms predicted probability of recall before seeing the quiz, - v3 posterior halflife after update,
- v3 posterior time to 80% recall probability,
- focal loss,
- and cumulative focal loss.
# | result | time | plaw | pRecall | hl | 80% hl | floss | βfloss | v3 | pRecall | hl | 80% hl | floss | βfloss |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 ~ | 0.3h | 1.00 | 125.4 | 25.1 | -0.0 | -0.0 | 1.00 | 132.5 | 25.7 | -0.0 | -0.0 | ||
2 | 2 ~ | 22.7h | 0.81 | 150.0 | 28.8 | -0.0 | -0.0 | 0.82 | 177.6 | 34.1 | -0.0 | -0.0 | ||
3 | 1 β | 70.3h | 0.65 | 70.6 | 17.2 | -0.4 | -0.4 | 0.68 | 57.2 | 14.8 | -0.5 | -0.5 | ||
4 | 1 β | 34.1h | 0.67 | 46.4 | 12.3 | -0.5 | -0.9 | 0.63 | 31.9 | 9.2 | -0.4 | -0.9 | ||
5 | 3 β | 13.7h | 0.78 | 51.1 | 13.4 | -0.0 | -0.9 | 0.72 | 37.2 | 10.6 | -0.0 | -1.0 | ||
6 | 3 β | 43.8h | 0.54 | 65.1 | 16.6 | -0.1 | -1.1 | 0.45 | 57.1 | 15.9 | -0.2 | -1.2 | ||
7 | 3 β | 148.0h | 0.30 | 104.6 | 24.9 | -0.6 | -1.7 | 0.25 | 169.2 | 42.6 | -0.8 | -2.0 | ||
8 | 3 β | 204.3h | 0.35 | 162.9 | 35.5 | -0.4 | -2.1 | 0.45 | 384.1 | 94.8 | -0.2 | -2.2 | ||
9 | 3 β | 357.9h | 0.35 | 267.6 | 51.2 | -0.5 | -2.5 | 0.52 | 733.7 | 186.6 | -0.2 | -2.4 | ||
10 | 3 β | 541.1h | 0.38 | 449.3 | 72.5 | -0.4 | -2.9 | 0.58 | 1194.4 | 309.9 | -0.1 | -2.5 | ||
11 | 4 β + | 1032.0h | 0.38 | 1444.9 | 142.1 | -0.7 | -3.7 | 0.54 | 2781.5 | 731.8 | -0.3 | -2.7 | ||
12 | 1 β | 2545.4h | 0.44 | 916.0 | 113.4 | -0.1 | -3.8 | 0.52 | 1835.7 | 516.3 | -0.2 | -2.9 | ||
13 | 4 β + | 49.1h | 0.89 | 1043.6 | 121.9 | -0.0 | -3.8 | 0.98 | 1881.5 | 529.1 | -0.0 | -2.9 | ||
14 | 4 β + | 192.4h | 0.74 | 1492.4 | 147.6 | -0.0 | -3.8 | 0.92 | 2061.7 | 579.3 | -0.0 | -2.9 | ||
15 | 1 β | 693.5h | 0.59 | 961.1 | 118.1 | -0.3 | -4.1 | 0.77 | 1401.0 | 408.1 | -0.9 | -3.8 | ||
16 | 4 β + | 47.8h | 0.90 | 1064.5 | 125.0 | -0.0 | -4.1 | 0.97 | 1432.9 | 417.2 | -0.0 | -3.8 | ||
17 | 4 β + | 119.3h | 0.81 | 1310.7 | 139.9 | -0.0 | -4.2 | 0.94 | 1513.3 | 440.0 | -0.0 | -3.8 | ||
18 | 3 β | 359.8h | 0.67 | 1611.5 | 155.9 | -0.0 | -4.2 | 0.83 | 1637.0 | 474.9 | -0.0 | -3.8 | ||
19 | 3 β | 747.6h | 0.59 | 2155.5 | 180.5 | -0.1 | -4.3 | 0.71 | 1904.5 | 549.7 | -0.0 | -3.8 | ||
20 | 3 β | 94.9h | 0.87 | 2343.4 | 188.0 | -0.0 | -4.3 | 0.96 | 1939.6 | 559.4 | -0.0 | -3.8 | ||
21 | 1 β | 1847.5h | 0.53 | 1592.5 | 156.5 | -0.2 | -4.5 | 0.51 | 1490.9 | 443.0 | -0.2 | -4.0 | ||
22 | 3 β | 27.0h | 0.95 | 1632.5 | 158.5 | -0.0 | -4.5 | 0.99 | 1497.9 | 445.0 | -0.0 | -4.0 | ||
23 | 3 β | 96.9h | 0.85 | 1755.2 | 164.5 | -0.0 | -4.5 | 0.95 | 1523.4 | 452.3 | -0.0 | -4.0 | ||
24 | 4 β + | 236.9h | 0.75 | 2289.7 | 187.9 | -0.0 | -4.5 | 0.89 | 1651.2 | 488.7 | -0.0 | -4.0 | ||
25 | 3 β | 809.3h | 0.62 | 2934.1 | 211.7 | -0.1 | -4.6 | 0.70 | 1884.1 | 554.0 | -0.0 | -4.0 | ||
26 | 3 β | 713.2h | 0.66 | 3701.3 | 235.9 | -0.0 | -4.7 | 0.75 | 2106.7 | 615.4 | -0.0 | -4.1 |
We see by the 11th quiz, after a run of β s with the last one 1000 hours after the last review, the v3 ensemble (right-most columns) has exploded the halflife to almost 2800 hours. However, the power law algorithm is still conservative and predicts 1500 hours halflife. Anki chose to schedule the 12th quiz 2500 hours later which failed. There are three runs of β s and in each, the power law chooses to strengthen its halflife in a different way from the v3 algorithm. Both choices seem defensible, and the probability of recall predicted by each is not bad.
The main issue that authors of Anki-style apps will see is that the time to reach 80% recall probability is extremely conservative for the power-law algorithm. Given that this quiz schedule (picked by Anki) results in the power-law algorithm assigning the quizzes 60-ish percent predicted recall probability, quiz apps might use the time to 60% (or 70%) recall to schedule reviews? I don't have good instincts when it comes to designing apps like this (as I said, I much prefer apps that don't schedule cards but rather check which are most at risk of being forgotten when I am ready to spend time reviewing, on my schedule), but it seems like there's a risk when the time-to-80% is too soon but time-to-70% is too far? Is this inevitable when scheduling reviews in the future, and should I not worry about it?
For apps that I tend to write, of course, that don't schedule reviews, the fact that the power-law model predicts probability of recall reasonably is very exciting. We now have a few different algorithms that solve the core issue raised in #43: we have an ensemble of Gamma random variables, an ensemble of Beta random variables (reusing Ebisu2), and also potentially this single-Beta power-law algorithm (also leveraging Ebisu2).
I still am not exactly sure how you convert Ebisu's predictions for each card review to the benchmarks
The code is here: https://github.com/open-spaced-repetition/fsrs-benchmark/blob/0da0a9adeecb4d354ce3f8933fc2d5ee300a9bb6/other.py#L234-L264
I'm sure you're taking care to avoid the problems with cross-entropy that I'm finding pop up when handling unbalanced data.
I think the unbalanced data is not big deal for FSRS. Because FSRS has could predict the probability of recall accurately:
Source: https://www.reddit.com/r/Anki/comments/15mab6e/fsrs_explained_part_2_accuracy/
The main issue that authors of Anki-style apps will see is that the time to reach 80% recall probability is extremely conservative for the power-law algorithm. Given that this quiz schedule (picked by Anki) results in the power-law algorithm assigning the quizzes 60-ish percent predicted recall probability, quiz apps might use the time to 60% (or 70%) recall to schedule reviews?
FSRS allows users to set their desired retention. FSRS use this formula to schedule the interval based on desired retention and stability:
$I(r,S) = 9 \cdot S \cdot \left(\cfrac{1}{r} - 1\right)$