janome Experimental N-Best results #3

I read this article https://jetbead.hatenablog.com/entry/20160119/1453139047 , and implemented.

Oct 28 '19 10:10 nakagami

Sorry for my late response, I had no time to respond here. Thanks for the nice patch! I have concerns about performance degradation and memory consumption when N is large. Maybe we should limit the maximum size of N to 2 or 3, but it's just a guess so some benchmark tests will be needed.

Nov 19 '19 15:11 mocobeta

It seems TravisCI failed because of Java 8 installation failure. The installation part has been disabled in the master branch, can you please merge the latest master?

Nov 19 '19 16:11 mocobeta

merged

Nov 20 '19 00:11 nakagami

@nakagami Sorry, seems there are some conflicts because the master branch drops python 2 support (#65). Could you resolve conflicts if you are still interested in this issue?

Jan 17 '20 15:01 mocobeta

merged

Jan 17 '20 22:01 nakagami

Hi @nakagami, again, sorry for the delay...... but I looked at the code closely.

The backward_aster() looks fine to me as an experimental feature - but I think it should not be merged into the Tokenizer class as-is, since this is incompatible with current Tokenier API.

Tokenizer#tokenize() method returns a nested list with the n_best parameter instead of a flat list or generator of tokens, this would be confusing for many users.
If the text length exceeds Tokenizer.MAX_CHUNK_SIZE, n_best parameter is silently ignored. I understand there is resource leak concern with current nBest implementation, but this implicit limitation is not intuitive.
Associated with the above point, n_best parameter doesn't work with stream parameter.

I'd propose we create a new class for tracking nBest path, say NBestTokenizer, for now. Would it be possible? Once we solve the conflicts and resource leak concerns I mentioned above, we will be able to add nBest to the first-class feature for the (default) Tokenizer class. Let me know if you have any thoughts / proposals about it.

Additionally, I think it would be really nice if the nBest feature supports dotfile option so that users can visualize the multiple paths with graphviz, though this does not limit us to proceed this issue. :slightly_smiling_face:

Feb 23 '20 13:02 mocobeta

I hope someone's (including @mocobeta ) work. I think I don't understand detailed implementation of janone, and I don't have enogh time to work.

And sorry, I don't have good idea that n-best ignore stream parameter ignored.

Feb 23 '20 22:02 nakagami

@nakagami Sure thing, the internal of Tokenizer class is one of the most complicated stuff, since this has been evolved through the past releases. I will try to find the way to get this in to the master for future updates. Anyway thank you for working on this!

Feb 24 '20 01:02 mocobeta