ebisu Better understand probability-based schedules to pick initial a=b

@mustafa0x sent this: https://svelte.dev/repl/23a76045aa884384b8de8b737d682a7f?version=3.15.0

I made a small enhancement: https://svelte.dev/repl/67711abbe3784fccb67dc6b3f614a6d6?version=3.15.0

Per my email:

For a=b=4, it takes 5 or 6 quizzes of all passes to get to 10x the initial half-life (assuming you quiz exactly at the half-life of course). But if you fail the first quiz, it takes 7 to 8 quizzes to get to 10x.

For a=b=2, it takes 4 quizzes of all passes to get to 10x initial half-life. But failing the first quiz and passing the rest, it takes 6 to 7 quizzes.

At a=b=4, quiz # 12 (of all-passes) has the same cumulative elapsed time since learning as quiz # 15 (of first-fail-then-pass). That's like three extra quizzes out of fifteen that you had to do to overcome that initial failure.

At a=b=2, quiz # 9 or # 10 (of all-passes) has the same cumulative elapsed time as quiz # 15. That initial failure results in more than five extra quizzes out of fifteen.

Nov 28 '19 01:11 fasiha

Thanks Ahmed!

I'm still doing quite a bit of analysis, but I feel that failing should have less of an impact (which is SM's biggest flaw IMO). I'm currently doing that by artificially increasing the amount of time passed to updateRecall if the quiz was failed.

Nov 29 '19 14:11 mustafa0x

I've been thinking of your comment, and two things that I do might help—

First is that, if you fail when probability of recall is low, then the halflife will barely change, since the algorithm isn't surprised you failed after a long time without studying. And very often I review when recall probability is low, so failure isn't something I'm afraid of (which might be a good thing or bad thing—some research does show that the more you struggle to recall, the better your retention gets; but personally the joy I get from remembering the answer after a long struggle is much less than finishing my reviews and moving on with my life).

Sub-note 1. By default I initialize new cards with halflife of 0.25 hours, so in a 30–60 minute review session I'll probably review several newly learned cards; but invariably there are several hours when I can't review due to work or sleep, so those cards have very low probability of recall after eight hours. When I pass them, there's a big boost in halflife; if I fail them, there's barely a blip.

Sub-note 2. I also find it's very useful to allow users to scale the initial halflife. Often I have cards I know well, but I'll scale the initial halflife by 10x or 100x or 1000x (for roughly two hours, a day, or a week halflife). I really hate reviewing cards that I know well.

Second, how much of an impact does failing really have? I made some changes to the Svelte Ebisu playground you made:

https://svelte.dev/repl/9e31d30ac2da4746bb0f405987876a66?version=3.15.0

to allow you to set what probability of recall to quiz on (instead of fixed at 50%), and tying a=b to be the same, and setting the halflife to 1.0 (arbitrary units since they scale). There's two columns of buttons, and the left three columns of numbers correspond to the first column of buttons, and similarly the right three columns of numbers correspond to the second column of buttons.

At a=b=4 and quizzing at 50%,
- assuming you pass your first two quizzes, the time to next quiz (i.e., time to pRecall going to 50%) is 1.19x the previous one.
- In contrast, if you fail your first quiz, and pass the second one, the time to next quiz (TTQ) is 1.16x.

As you can see, over a stretch of successful quizzes, the ratio of TTQ slowly decreases towards 1.0, that is, although the next quiz is still farther away from now than the previous quiz was from now, the rate of growth in TTQ is slowing down. This is in contrast to SM2/Anki, where ratio of time between quizzes is constant (assuming Good rating), i.e., the quizzes continue to be spaced exponentially apart.

This is actually the first time I realized this, and I found it so curious I had to prove that the ratio of TTQs for a string of passes converges to 1.0†. I actually like this a lot: in Anki, if you see a card after ten years and you click Good, it can schedule it for ~thirteen years in the future, which always made me uncomfortable because there's a limit to human memory (e.g., research showing doctors tend to forget what they learned in med school after five, ten years, in the absence of reviews). There's nothing in the Ebisu algorithm's design that sought to limit the growth of quiz intervals, but I don't think it's hard to explain: each successive successful quiz result gives more information about the memory model, but this doesn't force the halflife (or time-to-quiz TTQ at any given percentile) to go to infinity. The small chance of failure (encoded by initial b>1, or provided by an actual quiz failure) prevents the ratio of TTQs from growing without bound.

Well, maybe it's a little hard to explain, that doesn't seem very satisfying.

†The only case where the ratio of TTQs doesn't converge to 1 is when it converges to infinity, which only happens when b=1. If you start with a=b=1, that's a flat probability distribution over your recall probability at the initial halflife, and each successive success increases the TTQ (and increases the halflife) by the same fixed amount each quiz. Try setting a=b=1 and sliding the quiz probability slider around, and you'll get Anki/SM2-like behavior with exponential growth of TTQ and halflife alike, rather than Ebisu's damped-exponential, which eventually becomes linear.

Anyway getting back to the topic at hand: by playing with initial a=b>1 and the probability for quiz, you'll see that, even if you fail the first quiz, the ratio of time-to-quiz at each subsequent quiz gets close to the all-pass case within a handful of quizzes. In the initial a=b=4 and probability-of-quiz at 50%, by quiz # 9, the ratio of TTQ is 1.18x versus 1.15x, which is 3% different intervals, even though nine successful quizzes takes you to 20.1 total time units since learned, whereas one failure and eight successes takes you to only 12.7 total time units.

Reducing the probability-to-quiz to 10%, as I suggested above, shows you that the ratio of TTQs greatly increases: you're under-reviewing so the algorithm gets a lot of confidence that you know the flashcard. Comparing to the case above, quiz # 9 has TTQ-ratio of 1.61x for all-passes, whereas the fail+pass case has slower growth at 1.55x, but it's still only 4% different.

Lowering initial a=b strengthens this effect. You're making the algorithm initially less confident in the half-life you give it, so it responds more aggressively to quiz results. For initial a=b=2 and probability-to-quiz of 10%, quiz # 9's two TTQ ratios are 2.57x for all-pass and 2.13x for fail+pass, which is a factor of 20%. In this regime, I can sympathize with someone who is annoyed that the initial failure is causing this much more reviewing.

I'll think about a nice way to adjust the memory model and add it to the library. Here's a simple thing you can do: say you have a model, then you can convert it to a symmetric fully-balanced model like this:

model = [4.4, 3.3, 2.2] # for example
hl = ebisu.modelToPercentileDecay(model) # compute halflife
balancedModel = ebisu.updateRecall(model, True, 0.00001, False, hl)
# (3.336557408068802, 3.336553824064345, 2.783719390318511)

The new model is "balanced" because a=b (very close at least, since above we faked a quiz), which means the t component (third element of the new model tuple) is the halflife.

Now if your user is annoyed at how often they are seeing some flashcard, you can compute the balancedModel and set that flashcard's model to something like [balancedModel[0], balancedModel[1], balancedModel[2] * 1.5] to keep the same shape of the distribution but push the halflife out 1.5x. If your app has a way for the user to scale the initial halflife, you can maybe use the same UI to let them scale the halflife of any flashcard if they feel it's quizzed too frequently or infrequently.

I'll make a function, something like rebalanceModel that'll do the above step exactly (without faking a quiz) so you can do this more easily.

Sorry for the lengthy post, your Svelte playground and comment have been very thought-provoking (I may have stumbled on a couple of improvements to the algorithm to make the runtime faster). Thank you for your continued interest and support!

Dec 05 '19 05:12 fasiha

@mustafa0x a heads up & request—I've pushed a branch to this repo with a new API for updateRecall:

def updateRecall(prior, k: int, n: int, tnow: float)

The old boolean is gone and instead you supply two integers: k is the number of successful times this flashcard was quizzed during this review session, out of n total times. In Anki and most quiz apps, n is fixed to n=1, so k=0 or 1 depending on whether the user was successful the one time that the flashcard was presented. But Duolingo for example has a broader concept of "review session", so during a single review session, multiple flashcards can be quizzed potentially multiple times, and so we need this more flexible API.

Now, this can be (ab)used by a quiz app to model ease of recall, or to potentially ameliorate the impact of a failed quiz. E.g.,

even if the user sees the flashcard only once, maybe they claim that they got it "3 out of 3" times (n=3, k=3), which causes the updater to more aggressively update the model than if k=n=1.
Or, if the user failed the quiz, maybe they can say they got it "1 out of 3" times (n=3, k=1) so the model isn't penalized as heavily as the n=1, k=0 case.

The math behind this assumes each of the n reviews of this flashcard is probabilistically independent of the others (i.e., that the review session is a binomial random variable, rather than a Bernoulli random variable). If the quiz app doesn't provide any feedback during the review session, then this might be a reasonable model, but I think even Duolingo tells you if you correctly conjugated a verb or not after each sentence it asks you to produce, so in that case the mathematical model is unrealistic.

I mention this because from my informal tests, the new updateRecall can be very aggressive with large values of n. For example, assuming an initial halflife of a day, the following shows the halflife after a review session five days later:

In [50]: model = ebisu.defaultModel(1.0, 3.)

In [51]: ebisu.modelToPercentileDecay(ebisu.updateRecall(model, 1, 1, 5.0))
Out[51]: 2.3213221295470494

In [52]: ebisu.modelToPercentileDecay(ebisu.updateRecall(model, 3, 3, 5.0))
Out[52]: 4.924110315681589

In [53]: ebisu.modelToPercentileDecay(ebisu.updateRecall(model, 10, 10, 5.0))
Out[53]: 14.01871178521391

In [54]: ebisu.modelToPercentileDecay(ebisu.updateRecall(model, 1, 3, 5.0))
Out[54]: 1.8122585776163989

In [55]: ebisu.modelToPercentileDecay(ebisu.updateRecall(model, 5, 10, 5.0))
Out[55]: 3.5496017440233127

With all that as background, my request is, if you have the time and inclination, try out this new updateRecall. I feel you might have some unusual ways of using Ebisu so any feedback would be most useful to help build some folk wisdom about the right way this feature ought to be used.

You can find the updated Python file at https://github.com/fasiha/ebisu/blob/binomial-quiz/ebisu/ebisu.py in the binomial-quiz branch. I'll update the README and publish to PyPI in hopefully a couple of days in case that's more useful.

(Hat tip to #23 for prodding me.)

Mar 08 '20 04:03 fasiha

I'll update the README and publish to PyPI in hopefully a couple of days in case that's more useful.

Pushed 2.0.0 Python version. I'll try to update the JavaScript & Java versions sometime this week 🤞

Mar 09 '20 01:03 fasiha

@fasiha Wonderful work, thank you! I hope to soon take an in-depth look at this change.

Mar 09 '20 16:03 mustafa0x

@mustafa0x Did our two Svelte examples break 😢?

https://svelte.dev/repl/23a76045aa884384b8de8b737d682a7f?version=3.15.0
https://svelte.dev/repl/67711abbe3784fccb67dc6b3f614a6d6?version=3.15.0

They seem to render nothing.

Jul 31 '20 21:07 fasiha

They were made before v2, so updateRecall is missing the 3rd parameter 😅. I'll update my sample now!

Jul 31 '20 22:07 mustafa0x

Dohhh, thanks @mustafa0x, I didn't even think to check whether svelte.dev had version-pinned dependencies or fetched latest. Thanks! Updated mine too!

Aug 01 '20 00:08 fasiha

Closed this because Ebisu v3 is moving away from Beta priors on recall to Gamma priors on half-life. Please feel free to reopen or add comments!

Jan 14 '23 04:01 fasiha

ebisu ebisu copied to clipboard

Better understand probability-based schedules to pick initial a=b

ebisu
ebisu copied to clipboard