parglare
parglare copied to clipboard
Unicode handling
- parglare version: parglare (0.4.1)
- Python version: Python 2.7.14 (default, Jan 5 2018, 10:41:29) [GCC 7.2.1 20171224] on linux2
- Operating System: ARCH Linux
Description
Non-ascii rules are not supported under python 2.7 In python 3 they work fine. Simplest example:
# coding: utf-8
from parglare import Grammar
from parglare import Parser
grammar = Grammar.from_file("names.pg")
parser = Parser(grammar)
inp = 'МИША МЫЛ РАМУ'
print(inp)
result = parser.parse(inp)
print(result)
grammar:
LINE: FIO|SYMBOL;
FIO: /'МИША'|'САША'/;
SYMBOL: /\w+/;
What I Did
result in python 2.7
python ./names.py
МИША МЫЛ РАМУ
Traceback (most recent call last):
File "./names.py", line 9, in <module>
result = parser.parse(inp)
File "/usr/lib/python2.7/site-packages/parglare/parser.py", line 206, in parse
position)
File "/usr/lib/python2.7/site-packages/parglare/parser.py", line 480, in _skipws
while position < in_len and input_str[position] in self.ws:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)
Hi,
You must use unicode strings for Python 2.x. The easiest way is to import unicode_literals
from future.
Here is a fully working test:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import os
from parglare import Grammar
from parglare import Parser
def test_grammar_with_unicode():
this_folder = os.path.dirname(__file__)
grammar = Grammar.from_file(os.path.join(this_folder, "names.pg"))
parser = Parser(grammar)
inp = 'МИША МЫЛ РАМУ'
result = parser.parse(inp)
assert result
You also have an error in the grammar. The right should be:
LINE: FIO|SYMBOL;
FIO: /МИША|САША/;
SYMBOL: /\w+/;
Notice that there is no quotes around МИША
and САША
in the FIO
rule regex.
Thank you for your comment. But still, the behaviour with regular strings is hardly expected. Is it explained somewhere in the docs? Also, thanks for the grammar correction. That was just an example in an argument on that there ARE unicode tomita-parsers in python. Though, later it was said that the person meant not the regular tomita-parser, but it's Yandex implementation specifically.
Actually it is usual behaviour. In Python 3 writing text in quotes is interpreted as unicode string while in Python 2.x it is a array of bytes. In 2.x if you want unicode string you have to write u'...'
. That is another way to make it work in Python 2.x. pagrlare is fully unicode, meaning that it parses unicode string or utf-8
encoded textual files. Actually, if you define your own recognizers parglare can parse anything. :)
So, the issue you were dealing with is more related to the change in treating unicode strings in Python 2.x vs 3.x. It is not specific to parglare.
And yes, you are right, that should probably be documented. I think it is not at the moment.
I am not using unicode, but I have a similar issue with things working in Python 3 and not in Python 2. In my case, when I try to declare a custom action in the grammar_actions.py file like this:
def open_bracket(context, node):
context.extra.nest_bracket()
return node
action('open_brace')(open_bracket)
I get this error when running with python2:
File "parglare/parglare/common.py", line 140, in decorator
name = f.__name__
AttributeError: 'str' object has no attribute '__name__'
unless I put this at the top of my grammar_actions file:
from __future__ import unicode_literals
I think it's due to this at the top of parglare/parglare/common.py
if sys.version < '3':
text = unicode # NOQA
else:
text = str
Definitely confusing to a new user, especially since I am not trying to do anything with unicode.
Yup, this case seems odd. Could you please make a PR with failing test so I could investigate?
Will do.