parglare Unicode handling

parglare version: parglare (0.4.1)
Python version: Python 2.7.14 (default, Jan 5 2018, 10:41:29) [GCC 7.2.1 20171224] on linux2
Operating System: ARCH Linux

Description

Non-ascii rules are not supported under python 2.7 In python 3 they work fine. Simplest example:

# coding: utf-8
from parglare import Grammar
from parglare import Parser

grammar = Grammar.from_file("names.pg")
parser = Parser(grammar)
inp = 'МИША МЫЛ РАМУ'
print(inp)
result = parser.parse(inp)
print(result)

grammar:

LINE: FIO|SYMBOL;
FIO: /'МИША'|'САША'/;
SYMBOL: /\w+/;

What I Did

result in python 2.7

python ./names.py 
МИША МЫЛ РАМУ
Traceback (most recent call last):
  File "./names.py", line 9, in <module>
    result = parser.parse(inp)
  File "/usr/lib/python2.7/site-packages/parglare/parser.py", line 206, in parse
    position)
  File "/usr/lib/python2.7/site-packages/parglare/parser.py", line 480, in _skipws
    while position < in_len and input_str[position] in self.ws:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)

Mar 15 '18 08:03 survivorm

Hi, You must use unicode strings for Python 2.x. The easiest way is to import unicode_literals from future.

Here is a fully working test:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import os
from parglare import Grammar
from parglare import Parser


def test_grammar_with_unicode():
    this_folder = os.path.dirname(__file__)
    grammar = Grammar.from_file(os.path.join(this_folder, "names.pg"))
    parser = Parser(grammar)
    inp = 'МИША МЫЛ РАМУ'
    result = parser.parse(inp)
    assert result

You also have an error in the grammar. The right should be:

LINE: FIO|SYMBOL;
FIO: /МИША|САША/;
SYMBOL: /\w+/;

Notice that there is no quotes around МИША and САША in the FIO rule regex.

Mar 16 '18 11:03 igordejanovic

Thank you for your comment. But still, the behaviour with regular strings is hardly expected. Is it explained somewhere in the docs? Also, thanks for the grammar correction. That was just an example in an argument on that there ARE unicode tomita-parsers in python. Though, later it was said that the person meant not the regular tomita-parser, but it's Yandex implementation specifically.

Mar 19 '18 12:03 survivorm

Actually it is usual behaviour. In Python 3 writing text in quotes is interpreted as unicode string while in Python 2.x it is a array of bytes. In 2.x if you want unicode string you have to write u'...'. That is another way to make it work in Python 2.x. pagrlare is fully unicode, meaning that it parses unicode string or utf-8 encoded textual files. Actually, if you define your own recognizers parglare can parse anything. :)

So, the issue you were dealing with is more related to the change in treating unicode strings in Python 2.x vs 3.x. It is not specific to parglare.

Mar 19 '18 16:03 igordejanovic

And yes, you are right, that should probably be documented. I think it is not at the moment.

Mar 19 '18 16:03 igordejanovic

I am not using unicode, but I have a similar issue with things working in Python 3 and not in Python 2. In my case, when I try to declare a custom action in the grammar_actions.py file like this:

def open_bracket(context, node):
    context.extra.nest_bracket()
    return node
action('open_brace')(open_bracket)

I get this error when running with python2:

  File "parglare/parglare/common.py", line 140, in decorator
    name = f.__name__
AttributeError: 'str' object has no attribute '__name__'

unless I put this at the top of my grammar_actions file:

from __future__ import unicode_literals

I think it's due to this at the top of parglare/parglare/common.py

if sys.version < '3':
    text = unicode  # NOQA
else:
    text = str

Definitely confusing to a new user, especially since I am not trying to do anything with unicode.

Jul 26 '18 04:07 jwcraftsman

Yup, this case seems odd. Could you please make a PR with failing test so I could investigate?

Jul 26 '18 09:07 igordejanovic

Will do.

Jul 26 '18 13:07 jwcraftsman

parglare parglare copied to clipboard

Unicode handling

Description

What I Did

parglare
parglare copied to clipboard