Experimental N-Best results #3
I read this article https://jetbead.hatenablog.com/entry/20160119/1453139047 , and implemented.
Sorry for my late response, I had no time to respond here. Thanks for the nice patch! I have concerns about performance degradation and memory consumption when N is large. Maybe we should limit the maximum size of N to 2 or 3, but it's just a guess so some benchmark tests will be needed.
It seems TravisCI failed because of Java 8 installation failure. The installation part has been disabled in the master branch, can you please merge the latest master?
merged
@nakagami Sorry, seems there are some conflicts because the master branch drops python 2 support (#65). Could you resolve conflicts if you are still interested in this issue?
merged
Hi @nakagami, again, sorry for the delay...... but I looked at the code closely.
The backward_aster() looks fine to me as an experimental feature - but I think it should not be merged into the Tokenizer class as-is, since this is incompatible with current Tokenier API.
- Tokenizer#tokenize() method returns a nested list with the
n_bestparameter instead of a flat list or generator of tokens, this would be confusing for many users. - If the text length exceeds Tokenizer.MAX_CHUNK_SIZE,
n_bestparameter is silently ignored. I understand there is resource leak concern with current nBest implementation, but this implicit limitation is not intuitive. - Associated with the above point,
n_bestparameter doesn't work withstreamparameter.
I'd propose we create a new class for tracking nBest path, say NBestTokenizer, for now. Would it be possible? Once we solve the conflicts and resource leak concerns I mentioned above, we will be able to add nBest to the first-class feature for the (default) Tokenizer class.
Let me know if you have any thoughts / proposals about it.
Additionally, I think it would be really nice if the nBest feature supports dotfile option so that users can visualize the multiple paths with graphviz, though this does not limit us to proceed this issue. :slightly_smiling_face:
I hope someone's (including @mocobeta ) work. I think I don't understand detailed implementation of janone, and I don't have enogh time to work.
And sorry, I don't have good idea that n-best ignore stream parameter ignored.
@nakagami Sure thing, the internal of Tokenizer class is one of the most complicated stuff, since this has been evolved through the past releases. I will try to find the way to get this in to the master for future updates. Anyway thank you for working on this!