Palmetto icon indicating copy to clipboard operation
Palmetto copied to clipboard

How can I make it calculate more faster?

Open A11en0 opened this issue 3 years ago • 8 comments

The calculate is a little bit slow, is there some method to speed it up? Can I use GPU instead of CPU?

A11en0 avatar Oct 26 '21 03:10 A11en0

That depends a lot on how you actually use it.

The main bottleneck is reading the index. Using a GPU instead of a CPU does most probably not help since there are no expensive matrix operations :wink:

Do you have a large set of topics that you would like to evaluate at once?

MichaelRoeder avatar Oct 26 '21 08:10 MichaelRoeder

yes, I embedded it into my training code, to evaluate C_V per 10 epoch, and it apparently slow down my training speed.

A11en0 avatar Oct 26 '21 08:10 A11en0

by the way, I set my topic as 20 and topic words as 15, I guess calculating all topics (20) one time instead of calculate one by one would boost the speed since it just needs to read the index file one time, it can be reach?

A11en0 avatar Oct 26 '21 08:10 A11en0

  1. You add an additional step to your training that tries to evaluate your topics based on big statistics that it has to gather. So it is expected that it will need more time :wink: However, I understand that the longer training is annoying.
  2. I am not fully clear about your setup. I assume that you have a python program and that you execute the Palmetto web service in parallel. Is that right? Or do you use Palmetto on the command line? :thinking:

MichaelRoeder avatar Oct 26 '21 12:10 MichaelRoeder

I use the palmetto-py API palmetto.get_coherence() in my python training code, evaluate per 10 epoch. In my opinion, it will load the index file one time when I call the API one time. But it can just evaluate one topic each time. So, when I calculate the whole topic distribution(K topics), I need to call it K times! it'll spend too much time.

So, I suggest giving a new API that can calculate the whole K topic in a single call, as it just needs to load the index file one time.

A11en0 avatar Oct 27 '21 07:10 A11en0

Thanks for clarifying your setup. Your assumption is not correct. You start the web service only once at the very beginning of your program (at least I assume that). The calls to the web service will simply always cause a search on the index. It doesn't matter how many topics you could send at once. So although I think the extension of the API might be a good idea, it won't change the runtime.

The only change that I can think of at the moment with respect to runtime might be the implementation of a cache. The cache could be implemented as a decorator of the WindowSupportingLuceneCorpusAdapter class and cache the result of the requestDocumentsWithWord method. However, it may consume a lot of memory. Apart from that, I simply won't have time during the next months to look into that. Feel free to create a Pull Request or ask for some guidance if you want to give it a try as it is not as trivial as it might seem.

I may have two suggestions that could improve the runtime (I guess you already thought of them):

  1. Try to avoid unnecessary calls. You can store the coherence values that have been calculated in previous epochs. If a topic has the same top words as in the previous evaluation run, you don't have to evaluate it again. Depending on the topic coherence, the order of the top words may not have an influence on the coherence value (e.g., for C_V, the order of the words doesn't matter).
  2. Depending how far you are already with your approach, you may want to think about using a "cheaper" topic coherence. Most of the coherences in our paper use a window-based approach and are pretty costly. However, the UMass coherence is quite fast since it does not make use of the positions of words within the documents. So for trying whether your overall approach works, this might be an easy alternative to get some fast, first results. However, the quality of the coherence results are not so good.

MichaelRoeder avatar Oct 27 '21 11:10 MichaelRoeder

Thanks for your careful reply! I get your advice. But maybe it needs to change the java code in the project, I'm afraid that I don't have too much time to do it, as I just use the wonderful tool to test my own topic model, so I can't pay too much time on the coherent calculation methods.

I have a new problem is that the python interface API often gives me endpoint down error as I using the backend server in the local. I build the tomcat-based server following the instruction in https://github.com/dice-group/Palmetto/wiki/How-Palmetto-can-be-used. The problem from the python interface or java backend? I have no idea.

A11en0 avatar Oct 27 '21 12:10 A11en0

Yes, I can understand that. Seems like nobody has a lot of time these days :wink:

You can increase the time the python client waits by setting the timeout attribute.

As an alternative, you could also run Palmetto from command line and read the result from command line. This would ensure that your program waits and that you get rid of the HTTP-based communication. However, I am not sure how much effort it is to implement that within Python. So maybe it is just another weird idea :smile:

MichaelRoeder avatar Oct 27 '21 16:10 MichaelRoeder