fugashi
fugashi copied to clipboard
support accessing more fields in mecab node
This pr may resolve https://github.com/polm/fugashi/issues/76.
Here's an example:
from fugashi import Tagger
tagger = Tagger()
text = "麩菓子は、麩を主材料とした日本の菓子。"
node_list = tagger.parseToNodeList(text)
print("surface\twcost\tcost")
for node in node_list:
surface = node.surface
wcost = node.wcost
cost = node.cost
print("{}\t{}\t{}".format(surface, wcost, cost))
output:
surface wcost cost
麩 5477 12007
菓子 3841 19459
は -904 19947
、 -7508 16317
麩 5727 26362
を -1331 24371
主材 5114 33181
料 5720 38627
と -286 41321
し 2517 45155
た 3045 46329
日本 -2903 47649
の -294 48738
菓子 3841 55480
。 -3217 55216
I just modified parseToNodeList()
and nbestToNodeList()
to keep BOS/EOS in the node list. Because I think they could be useful in visualizing and analyzing the results.
Thank you for the PR. The code to add the fields looks fine, though there should be tests for it, even trivial ones. I can add them if you're not sure how to.
For the BOS/EOS nodes, returning those by default would break existing code that expects them to be removed and is not OK. We can put the functionality behind a parameter that is off by default.
though there should be tests for it, even trivial ones. I can add them if you're not sure how to.
I'm not sure how to add tests so it would be great if you could demonstrate that:)
For the BOS/EOS nodes, returning those by default would break existing code that expects them to be removed and is not OK. We can put the functionality behind a parameter that is off by default.
I just added a strip
parameter to parseToNodeList()
and nbestToNodeList()
. It's True
by default, meaning stripping BOS/EOS nodes.
Apologies for taking so long to get to this, but I've added some tests. I am still not sure about the BOS/EOS thing, especially giving them surfaces, so I'll think about it a little more.