tomotopy
tomotopy copied to clipboard
Reproducibility issues even after setting model seed
Hi, thank you for all your work on this amazing library!
I'm running into a strange issue with reproducibility: Even after setting the model seed, I'm still sometimes getting different LDA results with the same documents (a processed subset of the BBC news dataset).
My code is very simple -- It reads from a text file, where each line represents a single document with space-separated tokens, and trains an LDAModel over the data. I've turned off parallel processing to prevent any randomness from coming in there as well.
import tomotopy as tp
with open("docs.txt", "r", encoding="utf8") as fp:
model = tp.LDAModel(k=5, seed=123456789)
for line in fp:
model.add_doc(line.split())
for i in range(0, 1000, 100):
model.train(100, workers=1, parallel=tp.ParallelScheme.NONE)
print(f"Iteration: {i + 100} LL: {model.ll_per_word:.5f}")
When I run the code, I usually get the following output:
Iteration: 100 LL: -7.94113
Iteration: 200 LL: -7.90128
Iteration: 300 LL: -7.88406
Iteration: 400 LL: -7.86940
Iteration: 500 LL: -7.85939
Iteration: 600 LL: -7.84511
Iteration: 700 LL: -7.84116
Iteration: 800 LL: -7.83339
Iteration: 900 LL: -7.83029
Iteration: 1000 LL: -7.82927
But about 30% of the time I get the following output instead, where the stats seem to diverge at iteration 300:
Iteration: 100 LL: -7.94113
Iteration: 200 LL: -7.90128
Iteration: 300 LL: -7.88715
Iteration: 400 LL: -7.87158
Iteration: 500 LL: -7.86242
Iteration: 600 LL: -7.84669
Iteration: 700 LL: -7.84028
Iteration: 800 LL: -7.82794
Iteration: 900 LL: -7.82512
Iteration: 1000 LL: -7.82317
The results seem to switch randomly between these two possibilities (I haven't seen any other variations turn up), but I just can't seem to figure out where the indeterminacy is coming from. Would appreciate any advice or help you could provide!
Attached: docs.txt
Discovered that PYTHONHASHSEED
seems to be affecting the results --
Invoking the script as:
PYTHONHASHSEED=429467291 python lda.py
always gives the first set of results, and invoking it as:
PYTHONHASHSEED=429467292 python lda.py
always gives the second set of results.
I wonder if it would be possible to have the algorithm give stable results across different hash seeds?
Thank you for reporting a potential bug. I'll examine your code and data and figure out why PYTHONHASHSEED affects the results.
Update: After a long fruitless wild goose chase trying to track down the source of the indeterminacy, something seems to have fixed it and I now only get the second set of results no matter what PYTHONHASHSEED
is set to.
Chalk it up to the oddest of heisenbugs, and I'll be setting PYTHONHASHSEED
for my project from now on to be safe, but I don't think I can reproduce this anymore.
As far as I can tell, it was clearing the Windows 10 Prefetch cache that did it, so on the off chance that someone else on Windows runs into the same kind of behaviour, that's something that might help!