mecab icon indicating copy to clipboard operation
mecab copied to clipboard

Python wrapper: surface text garbled in first call to parseToNode

Open GoogleCodeExporter opened this issue 9 years ago • 3 comments

What steps will reproduce the problem?

    $ python
    Python 2.7.3 (default, Aug  1 2012, 05:14:39)
    [GCC 4.6.3] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> result = ""
    >>> import MeCab
    >>> t = MeCab.Tagger()
    >>> n = t.parseToNode("結晶系は正方晶系。")
    >>> result = ""
    >>> while n is not None:
    ...     result += n.surface
    ...     n = n.next
    ...
    >>> assert result == "結晶系は正方晶系。", repr(result)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AssertionError: '\x01rf\xff\xff\xff\xff\xff\xff\xff'
    >>>

What is the expected output? What do you see instead?

    The assertion should succeed (no exception thrown).

What version of the product are you using? On what operating system?

    MeCab version 0.996 on Ubuntu Precise.

Please provide any additional information below.

    On my machine the above code always reproduces the problem,
    but other code structures such as assigning the text to a
    variable before parsing or moving the test code into a function
    definition causes the test to run correctly.

    This bug only affects the initial call to a tagger and only if
    the call is parseToNode. The following incantation is a reliable
    workaround:

    >>> t = Tagger()
    >>> t.parse("")

    The tagger can then be used as normal.


Original issue reported on code.google.com by [email protected] on 18 Mar 2013 at 1:03

GoogleCodeExporter avatar Mar 14 '15 15:03 GoogleCodeExporter

I've had a look at the source, and I think I've tracked this down to a memory 
bug in mecab itself.

LatticeImpl::set_sentence uses has_request_type() to determine whether it 
should allocate new memory for the sentence or just reuse the memory passed as 
its `sentence' argument. However, the various TaggerImpl::parse* methods all 
call lattice->set_sentence *before* they properly set the request type in the 
lattice (via TaggerImpl::initRequestType()). This means that on each call to a 
tagger parse method the lattice uses the previous call's request type. On the 
first call to a tagger parse method the lattice uses whatever its request_type_ 
is initialised to.

The end result is that when calling the tagger parse methods sometimes the 
lattice incorrectly reuses the memory it has been passed instead of allocating 
new memory. The python wrapper or python runtime may subsequently reallocate 
that memory for other uses and it may get overwritten with new data. Then the 
nodes returned by parseToNode no longer point to the surface text of the 
sentence.

The fix should be to call set_sentence after the request type has been set. 
I've attached a patch against the 0.996 source download for mecab. It fixes the 
behaviour in this bug report.

Original comment by [email protected] on 19 Mar 2013 at 3:46

Attachments:

GoogleCodeExporter avatar Mar 14 '15 15:03 GoogleCodeExporter

This change causes another issue:

reordering initRequestType() and set_sentence() causes reinitialization of lattice->theta_ to default in set_sentence() (via clear())

Note this issue was fixed by https://github.com/taku910/mecab/pull/24 in 2016.

polm avatar Nov 30 '21 06:11 polm