tomotopy icon indicating copy to clipboard operation
tomotopy copied to clipboard

Reproducibility issues even after setting model seed

Open ZechyW opened this issue 4 years ago • 3 comments

Hi, thank you for all your work on this amazing library!

I'm running into a strange issue with reproducibility: Even after setting the model seed, I'm still sometimes getting different LDA results with the same documents (a processed subset of the BBC news dataset).

My code is very simple -- It reads from a text file, where each line represents a single document with space-separated tokens, and trains an LDAModel over the data. I've turned off parallel processing to prevent any randomness from coming in there as well.

import tomotopy as tp

with open("docs.txt", "r", encoding="utf8") as fp:
    model = tp.LDAModel(k=5, seed=123456789)
    for line in fp:
        model.add_doc(line.split())

for i in range(0, 1000, 100):
    model.train(100, workers=1, parallel=tp.ParallelScheme.NONE)
    print(f"Iteration: {i + 100} LL: {model.ll_per_word:.5f}")

When I run the code, I usually get the following output:

Iteration: 100 LL: -7.94113
Iteration: 200 LL: -7.90128
Iteration: 300 LL: -7.88406
Iteration: 400 LL: -7.86940
Iteration: 500 LL: -7.85939
Iteration: 600 LL: -7.84511
Iteration: 700 LL: -7.84116
Iteration: 800 LL: -7.83339
Iteration: 900 LL: -7.83029
Iteration: 1000 LL: -7.82927

But about 30% of the time I get the following output instead, where the stats seem to diverge at iteration 300:

Iteration: 100 LL: -7.94113
Iteration: 200 LL: -7.90128
Iteration: 300 LL: -7.88715
Iteration: 400 LL: -7.87158
Iteration: 500 LL: -7.86242
Iteration: 600 LL: -7.84669
Iteration: 700 LL: -7.84028
Iteration: 800 LL: -7.82794
Iteration: 900 LL: -7.82512
Iteration: 1000 LL: -7.82317

The results seem to switch randomly between these two possibilities (I haven't seen any other variations turn up), but I just can't seem to figure out where the indeterminacy is coming from. Would appreciate any advice or help you could provide!

Attached: docs.txt

ZechyW avatar Jun 10 '20 07:06 ZechyW

Discovered that PYTHONHASHSEED seems to be affecting the results --

Invoking the script as: PYTHONHASHSEED=429467291 python lda.py always gives the first set of results, and invoking it as: PYTHONHASHSEED=429467292 python lda.py always gives the second set of results.

I wonder if it would be possible to have the algorithm give stable results across different hash seeds?

ZechyW avatar Jun 10 '20 10:06 ZechyW

Thank you for reporting a potential bug. I'll examine your code and data and figure out why PYTHONHASHSEED affects the results.

bab2min avatar Jun 12 '20 11:06 bab2min

Update: After a long fruitless wild goose chase trying to track down the source of the indeterminacy, something seems to have fixed it and I now only get the second set of results no matter what PYTHONHASHSEED is set to. Chalk it up to the oddest of heisenbugs, and I'll be setting PYTHONHASHSEED for my project from now on to be safe, but I don't think I can reproduce this anymore.

As far as I can tell, it was clearing the Windows 10 Prefetch cache that did it, so on the off chance that someone else on Windows runs into the same kind of behaviour, that's something that might help!

ZechyW avatar Jun 15 '20 01:06 ZechyW