fugashi support accessing more fields in mecab node

This pr may resolve https://github.com/polm/fugashi/issues/76.

Here's an example:

from fugashi import Tagger

tagger = Tagger()

text = "麩菓子は、麩を主材料とした日本の菓子。"

node_list = tagger.parseToNodeList(text)

print("surface\twcost\tcost")

for node in node_list:
  surface = node.surface
  wcost = node.wcost
  cost = node.cost

  print("{}\t{}\t{}".format(surface, wcost, cost))

output:

surface	wcost	cost
麩	5477	12007
菓子	3841	19459
は	-904	19947
、	-7508	16317
麩	5727	26362
を	-1331	24371
主材	5114	33181
料	5720	38627
と	-286	41321
し	2517	45155
た	3045	46329
日本	-2903	47649
の	-294	48738
菓子	3841	55480
。	-3217	55216

Sep 22 '23 15:09 sophiefy

I just modified parseToNodeList() and nbestToNodeList() to keep BOS/EOS in the node list. Because I think they could be useful in visualizing and analyzing the results.

fugashi

Sep 23 '23 06:09 sophiefy

Thank you for the PR. The code to add the fields looks fine, though there should be tests for it, even trivial ones. I can add them if you're not sure how to.

For the BOS/EOS nodes, returning those by default would break existing code that expects them to be removed and is not OK. We can put the functionality behind a parameter that is off by default.

Sep 23 '23 11:09 polm

though there should be tests for it, even trivial ones. I can add them if you're not sure how to.

I'm not sure how to add tests so it would be great if you could demonstrate that:)

For the BOS/EOS nodes, returning those by default would break existing code that expects them to be removed and is not OK. We can put the functionality behind a parameter that is off by default.

I just added a strip parameter to parseToNodeList() and nbestToNodeList(). It's True by default, meaning stripping BOS/EOS nodes.

Sep 23 '23 12:09 sophiefy

Apologies for taking so long to get to this, but I've added some tests. I am still not sure about the BOS/EOS thing, especially giving them surfaces, so I'll think about it a little more.

Apr 15 '24 13:04 polm

fugashi fugashi copied to clipboard

support accessing more fields in mecab node

fugashi
fugashi copied to clipboard