kuromoji
kuromoji copied to clipboard
Debug graph for multi-tokenization
Overview
Adds the function debugMultiTokenize()
similar to the previously existing debugTokenize()
, but with support for multi-tokenization. The function generates a graph in DOT format.
Details
Each tokenization corresponds to a path in the graph. We assign a color to each such path and color the edges accordingly. If an edge is included in more than one path it will have more than one color.
This feature also adds a legend to the graph to show which path corresponds to which color.
Screenshots
Possible Issues
There are a few issues that I would be happy to get opinions on.
Colors
The colors are generated by selecting equidistant angles in the HSB color model, starting from the green color which was previously used in the debugTokenize()
function.
Pros
- It is very easy to generate colors in this way and it can be done for any number of paths.
- If the number of paths is small, the colors are easy to tell apart.
Cons
- Colors are not constant in the sense that "path 1" will have a different color if the graph contains 2 paths than if it contains 3 paths.
- If the number of paths is large, the last path will have a color very similar to the first path.
Legend
As far as I know DOT does not have a simple way to make legends. The one being used right now is made as a custom subgraph cluster. By letting DOT handle positions and lengths of edges, I think the legend ends up being a bit unnecessarily wide. Maybe there is a better way to create it.
The legend is placed in the bottom left, which might not be ideal.