astroid
astroid copied to clipboard
Unicode characters count as three columns instead of one
Steps to reproduce
- Change the line https://github.com/PyCQA/astroid/blob/492a4bfde15fdd4e14de7a93a160a9fafa6ecf46/tests/unittest_nodes_lineno.py#L963-L963 to e.g.
(replacing the firstf"Hello Worl’: {42.1234:02d}" #@dwith the unicode character’). - Run
$ pytest tests/unittest_nodes_lineno.py
Current behavior
The test TestLinenoColOffset.test_end_lineno_string fails as follows:
================================================ FAILURES =================================================
_______________________________ TestLinenoColOffset.test_end_lineno_string ________________________________
@staticmethod
def test_end_lineno_string() -> None:
"""FormattedValue, JoinedStr."""
code = textwrap.dedent(
"""
f"Hello Worl’: {42.1234:02d}" #@
f"Hello: {name=}" #@
"""
).strip()
ast_nodes = builder.extract_node(code)
assert isinstance(ast_nodes, list) and len(ast_nodes) == 2
s1 = ast_nodes[0]
assert isinstance(s1, nodes.JoinedStr)
assert isinstance(s1.values[0], nodes.Const)
assert (s1.lineno, s1.col_offset) == (1, 0)
> assert (s1.end_lineno, s1.end_col_offset) == (1, 29)
E assert (1, 31) == (1, 29)
E At index 1 diff: 31 != 29
E Use -v to get more diff
tests/unittest_nodes_lineno.py:974: AssertionError
========================================= short test summary info =========================================
Expected behavior
That the unit test passes.
python -c "from astroid import __pkginfo__; print(__pkginfo__.version)" output
2.13.0-dev0, on 492a4bfde15fdd4e14de7a93a160a9fafa6ecf46 with Python 3.10.5.
Thank you for opening the issue. This could be related to the python ast directly, but I did not investigate in detail.
I did a little digging after being triggered by https://github.com/PyCQA/pylint/issues/8072#issuecomment-1385692465. Astroid uses ast.parse from the stdlib, and the returned ast.expr instances use byte offsets, as described in the Python docs (emphasis mine):
[...] the col_offset and end_col_offset are the corresponding UTF-8 byte offsets of the first and last tokens that generated the node.
So that rules out any bug in ast.parse of the stdlib, and explains the behaviour in the opening post.
But I'm not sure what Astroid / Pylint expects, byte offsets or character offsets. As an end-user of Pylint, I would expect character offsets.
Thank you for the investigation, it might be a bug in pylint/astroid after all, we do need char offset as you say.
Hi everyone, Any news about this bug?
Hi @Turiok , no one worked on it yet, all contributions are welcome :)
I don't really understand the issue? If you change the character to a character to a character with a different byte offset the offset changes. ast uses bytes so astroid also does so. I don't think there is much to do for us here? This seems like a limitation in the stdlib.
Hi,
I thought pylint reported the column of the error in number of characters, not its size.
We report the column of the error as it is given to us from the stdlib of CPython/PyPy. We don't do any location determination ourselves.