horizont icon indicating copy to clipboard operation
horizont copied to clipboard

Ability to continue model fitting

Open tboggs opened this issue 10 years ago • 12 comments

For sufficiently large vocabularies/corpora, fitting the model can take quite a while. If the model hasn't converged after the chosen number of iterations, one must call fit (or fit_transform) again with a larger number of iterations, essentially repeating all the previous iterations before continuing to improve the model fit. This change adds a continue_fitting method to the LDA class that allows one to continue where previous fitting left off. To accomplish this I did the following:

  • Split the _fit method into _fit and _run_fitting_iterations (which is called by _fit).
  • Do not delete the WS, DS, and ZS members after fitting (I added a delete_temp_vars method to explicitly do that, if desired).
  • Added the continue_fitting method (which also calls _run_fitting_iterations).

Lastly, I set the random_state member in LDA.__init__. Take a look at that specifically and let me know if that change would adversely affect anything. I'm not sure what the intent of the following two lines were:

    self.random_state = random_state
    rng = sklearn.utils.check_random_state(random_state)

When the random_state argument is None, it appears that the random seed will be initialized twice: once when rng is set (in the line above) and again when self.random_state is used (since it is still equal to None above). Note that my changes still have the seed set twice (though I could easily change that) - the difference is that the random_state member is explicitly initialized in the constructor now.

tboggs avatar Aug 01 '14 03:08 tboggs

These are good ideas! I still think HCA or Mallet is probably the right tool for serious fitting but I'm happy to implement these changes. For simplicity's sake and to keep to sklearn's philosophy, I'd like to have the interface be as simple as possible. (One imagined audience is researchers in the humanities and social sciences with small corpora.) I like deleting WS and DS because they seem likely to confuse people if they are undocumented and they are also to recover. ZS can be recovered, with some effort, from ndz and nzw. WS, DS and ZS are big in terms of memory as well, so it's useful to delete them.

What about adding a resume=True parameter to self.fit() and having one recover ZS from the ndz. That would only add a tiny number of lines, I think. It also might be useful to look at how gensim handles resuming, if it does.

Let me add some code comments to the random_state business. Essentially, I'm worried about people being able to see the random seed they originally passed. For example, if the seed is set to 5 and I do something like self.random_state = sklearn.utils.check_random_state(random_state), I've lost the 5. There might be a better way to do this. I need to look at how sklearn handles this.

ariddell avatar Aug 01 '14 11:08 ariddell

I thought about adding options to fit but it seemed that it would be more confusing rather than less. First, continuing only makes sense for the same corpus so either the X value becomes an irrelevant argument in that case or there is a danger of a user passing in a different corpus (X) when they try to continue. So I did it the way in the pull request so that fit would keep its current interface and avoid any ambiguities.

With regard to the temporary values, I see your point. I'm not sure how better to handle that. One option is to accept a keyword to fit (e.g., enable_continue or something similar) that would cause the temp variables to be retained. If they don't pass that keyword, then they wouldn't be able to continue fitting.

Regarding random_state, I'm not sure I follow. In the original __init__ code (quoted above), it looks like the user will be unable to get the initial random_state when the default is used. That is, when the user doesn't pass random_state as an arg, then self.random_state is set to None before rng is initialized. It seems like it would be better to do this:

    self.random_state = sklearn.utils.check_random_state(random_state)
    # random numbers that are reused
    self._rands = self.random_state.rand(1000)

That is, just eliminate the temporary rng variable. But maybe I'm misunderstanding you. If the intent is to make the run reproducible, wouldn't passing a value of 5 work with the code above since self.random_state would be set in a deterministic way from the value given?

tboggs avatar Aug 01 '14 12:08 tboggs

re: random_state the issue is you want to record the seed passed somewhere so someone can look at it (in case they forgot it) and then rerun the model to get the exact same results. I realize it's a little confusing. There's probably a clearer way.

ariddell avatar Aug 01 '14 14:08 ariddell

You make some good points. Can I think about this more in a week or so when I have more time? In principle, I'd really like to stick to the sklearn interface as closely as possible. Perhaps they have a method for resuming fits? I trust you're comfortable using your fork for the immediate future.

For example, I'd really like that one could just drop in LDA as a replacement for NMF: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html I think it would also be great if the documentation was as terse and as welcoming.

ariddell avatar Aug 01 '14 14:08 ariddell

Take your time. I'm content using my fork in the meantime.

Understood regarding the sklearn interface. But one difference I see (maybe I'm missing something) is that the NMF interface lets the user specify stopping criteria (in addition to max_iter). I suppose if there were a stopping criterion in the LDA model, then one could just set n_iter to a huge number and the model will stop when fitted. But currently, it always runs for the specified number of iterations. If that number is too high, one has to wait a long time and if it's too low, then the process has to be repeated with a higher n_iter, which becomes a lengthy trial & error process.

Regarding random_state, I think we're saying the same thing. If one currently uses the default random_state arg, self._rands is set before self.random_state is set to a non-None state. So there is no way for the user to get the initial random state to reproduce the data. Perhaps self.random_state could be initialized via check_random_state in LDA.__init__ and setting self._rands could be deferred until fit is called. That would give the user the opportunity to access the random_state member before it is actually used to generate values.

tboggs avatar Aug 01 '14 14:08 tboggs

Maybe one could do something innovative here. If one ran two threads in parallel one could test for convergence, I think. n_iter could become min_iter and max_iter but one could also set those two independently.

ariddell avatar Aug 01 '14 14:08 ariddell

Wait... I thought I was the one who was supposed to be breaking the sklearn paradigm. :wink:

Just kidding.

There are likely a few issues with running multiple threads, one of which is doubling the processing requirements. There must be some commonly used stopping criteria for learning topic models. Perhaps a tolerance on the change in log(perplexity)?

tboggs avatar Aug 01 '14 14:08 tboggs

Processing requirements will never be a problem. Everyone has more than 1 core these days; one can release the thread in Cython. WS and DS can be shared by both threads, only ZS will differ, so only modestly more memory.

Here's the idea:

  • two threads with random initialization, test for convergence.
  • alternatively, one thread with random initialization, another thread initializes with NMF and test for convergence.

If this allows users to simply set max_iter then it would move us closer to the simplicity of sklearn. That, at least, is my thinking.

ariddell avatar Aug 01 '14 15:08 ariddell

Test for convergence by calculating the Rhat of the complete loglikelihood of the two chains. Perhaps one can do this ever 100 iterations.

ariddell avatar Aug 01 '14 15:08 ariddell

I've had another thought. What if one passed data from the existing model (e.g., the phi and the theta) as initial values to a fresh fit. It would be continuation in all but name. How does that sound?

ariddell avatar Aug 14 '14 19:08 ariddell

I think that would work. Do you mean calling fit on the same model but with phi and theta as arguments or creating a new model with phi and theta as arguments to __init__? The interface might be a little messier with the first interpretation. If you meant the second, there could be a static method to create a new model from an existing model (or model parameters). i.e., there could be a static method that is called like

model2 = LDA.from_model(<args>)

where <args> would either be the first model or the learned parameters from the first model.

In the runs that I did, there seemed to be a significant amount of time spent building the initial internal model data arrays so it would be nice if one could set a flag to not delete them (so they could be easily reused in the second model) but I understand your concern about keeping those parameters around unnecessarily.

Regarding model convergence, unless there is no other reasonable option, I would avoid trying to run two models concurrently to test for convergence. It seems like there could be situations where they either appear to converge but actually have not. More importantly, I think it would be better to use the multiple cores for solving the "main" model (i.e., utilize multithreading when solving the model). Otherwise, even with multithreading, the effective number of cores would always be cut in half to run the second model.

tboggs avatar Aug 14 '14 20:08 tboggs

I mean the second. One would add new parameters like initial_theta and initial_phi (or initial_ndz initial_nzw) to init.

Multiple cores is a great idea. Someday I'll get around to this. It looks like one could do something with atomic operations that are in recent versions of GCC. hca does this very well.

ariddell avatar Aug 14 '14 20:08 ariddell