horizont
horizont copied to clipboard
Ability to continue model fitting
For sufficiently large vocabularies/corpora, fitting the model can take quite a while. If the model hasn't converged after the chosen number of iterations, one must call fit
(or fit_transform
) again with a larger number of iterations, essentially repeating all the previous iterations before continuing to improve the model fit. This change adds a continue_fitting
method to the LDA
class that allows one to continue where previous fitting left off. To accomplish this I did the following:
- Split the
_fit
method into_fit
and_run_fitting_iterations
(which is called by_fit
). - Do not delete the
WS
,DS
, andZS
members after fitting (I added adelete_temp_vars
method to explicitly do that, if desired). - Added the
continue_fitting
method (which also calls_run_fitting_iterations
).
Lastly, I set the random_state
member in LDA.__init__
. Take a look at that specifically and let me know if that change would adversely affect anything. I'm not sure what the intent of the following two lines were:
self.random_state = random_state
rng = sklearn.utils.check_random_state(random_state)
When the random_state
argument is None
, it appears that the random seed will be initialized twice: once when rng
is set (in the line above) and again when self.random_state
is used (since it is still equal to None
above). Note that my changes still have the seed set twice (though I could easily change that) - the difference is that the random_state
member is explicitly initialized in the constructor now.
These are good ideas! I still think HCA or Mallet is probably the right tool for serious fitting but I'm happy to implement these changes. For simplicity's sake and to keep to sklearn's philosophy, I'd like to have the interface be as simple as possible. (One imagined audience is researchers in the humanities and social sciences with small corpora.) I like deleting WS and DS because they seem likely to confuse people if they are undocumented and they are also to recover. ZS can be recovered, with some effort, from ndz
and nzw
. WS, DS and ZS are big in terms of memory as well, so it's useful to delete them.
What about adding a resume=True
parameter to self.fit()
and having one recover ZS from the ndz
. That would only add a tiny number of lines, I think. It also might be useful to look at how gensim
handles resuming, if it does.
Let me add some code comments to the random_state business. Essentially, I'm worried about people being able to see the random seed they originally passed. For example, if the seed is set to 5
and I do something like self.random_state = sklearn.utils.check_random_state(random_state)
, I've lost the 5. There might be a better way to do this. I need to look at how sklearn handles this.
I thought about adding options to fit
but it seemed that it would be more confusing rather than less. First, continuing only makes sense for the same corpus so either the X
value becomes an irrelevant argument in that case or there is a danger of a user passing in a different corpus (X
) when they try to continue. So I did it the way in the pull request so that fit
would keep its current interface and avoid any ambiguities.
With regard to the temporary values, I see your point. I'm not sure how better to handle that. One option is to accept a keyword to fit
(e.g., enable_continue
or something similar) that would cause the temp variables to be retained. If they don't pass that keyword, then they wouldn't be able to continue fitting.
Regarding random_state
, I'm not sure I follow. In the original __init__
code (quoted above), it looks like the user will be unable to get the initial random_state
when the default is used. That is, when the user doesn't pass random_state
as an arg, then self.random_state
is set to None
before rng
is initialized. It seems like it would be better to do this:
self.random_state = sklearn.utils.check_random_state(random_state)
# random numbers that are reused
self._rands = self.random_state.rand(1000)
That is, just eliminate the temporary rng
variable. But maybe I'm misunderstanding you. If the intent is to make the run reproducible, wouldn't passing a value of 5
work with the code above since self.random_state
would be set in a deterministic way from the value given?
re: random_state the issue is you want to record the seed passed somewhere so someone can look at it (in case they forgot it) and then rerun the model to get the exact same results. I realize it's a little confusing. There's probably a clearer way.
You make some good points. Can I think about this more in a week or so when I have more time? In principle, I'd really like to stick to the sklearn interface as closely as possible. Perhaps they have a method for resuming fits? I trust you're comfortable using your fork for the immediate future.
For example, I'd really like that one could just drop in LDA as a replacement for NMF: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html I think it would also be great if the documentation was as terse and as welcoming.
Take your time. I'm content using my fork in the meantime.
Understood regarding the sklearn interface. But one difference I see (maybe I'm missing something) is that the NMF interface lets the user specify stopping criteria (in addition to max_iter
). I suppose if there were a stopping criterion in the LDA model, then one could just set n_iter
to a huge number and the model will stop when fitted. But currently, it always runs for the specified number of iterations. If that number is too high, one has to wait a long time and if it's too low, then the process has to be repeated with a higher n_iter
, which becomes a lengthy trial & error process.
Regarding random_state
, I think we're saying the same thing. If one currently uses the default random_state
arg, self._rands
is set before self.random_state
is set to a non-None state. So there is no way for the user to get the initial random state to reproduce the data. Perhaps self.random_state
could be initialized via check_random_state
in LDA.__init__
and setting self._rands
could be deferred until fit
is called. That would give the user the opportunity to access the random_state
member before it is actually used to generate values.
Maybe one could do something innovative here. If one ran two threads in parallel one could test for convergence, I think. n_iter
could become min_iter
and max_iter
but one could also set those two independently.
Wait... I thought I was the one who was supposed to be breaking the sklearn paradigm. :wink:
Just kidding.
There are likely a few issues with running multiple threads, one of which is doubling the processing requirements. There must be some commonly used stopping criteria for learning topic models. Perhaps a tolerance on the change in log(perplexity)
?
Processing requirements will never be a problem. Everyone has more than 1 core these days; one can release the thread in Cython. WS and DS can be shared by both threads, only ZS will differ, so only modestly more memory.
Here's the idea:
- two threads with random initialization, test for convergence.
- alternatively, one thread with random initialization, another thread initializes with NMF and test for convergence.
If this allows users to simply set max_iter
then it would move us closer to the simplicity of sklearn. That, at least, is my thinking.
Test for convergence by calculating the Rhat of the complete loglikelihood of the two chains. Perhaps one can do this ever 100 iterations.
I've had another thought. What if one passed data from the existing model (e.g., the phi and the theta) as initial values to a fresh fit. It would be continuation in all but name. How does that sound?
I think that would work. Do you mean calling fit
on the same model but with phi and theta as arguments or creating a new model with phi and theta as arguments to __init__
? The interface might be a little messier with the first interpretation. If you meant the second, there could be a static method to create a new model from an existing model (or model parameters). i.e., there could be a static method that is called like
model2 = LDA.from_model(<args>)
where <args>
would either be the first model or the learned parameters from the first model.
In the runs that I did, there seemed to be a significant amount of time spent building the initial internal model data arrays so it would be nice if one could set a flag to not delete them (so they could be easily reused in the second model) but I understand your concern about keeping those parameters around unnecessarily.
Regarding model convergence, unless there is no other reasonable option, I would avoid trying to run two models concurrently to test for convergence. It seems like there could be situations where they either appear to converge but actually have not. More importantly, I think it would be better to use the multiple cores for solving the "main" model (i.e., utilize multithreading when solving the model). Otherwise, even with multithreading, the effective number of cores would always be cut in half to run the second model.
I mean the second. One would add new parameters like initial_theta and initial_phi (or initial_ndz initial_nzw) to init.
Multiple cores is a great idea. Someday I'll get around to this. It looks like one could do something with atomic operations that are in recent versions of GCC. hca
does this very well.