astroid icon indicating copy to clipboard operation
astroid copied to clipboard

Unicode characters count as three columns instead of one

Open teake opened this issue 3 years ago • 8 comments

Steps to reproduce

  1. Change the line https://github.com/PyCQA/astroid/blob/492a4bfde15fdd4e14de7a93a160a9fafa6ecf46/tests/unittest_nodes_lineno.py#L963-L963 to e.g.
            f"Hello Worl’: {42.1234:02d}"  #@
    
    (replacing the first d with the unicode character ).
  2. Run $ pytest tests/unittest_nodes_lineno.py

Current behavior

The test TestLinenoColOffset.test_end_lineno_string fails as follows:

================================================ FAILURES =================================================
_______________________________ TestLinenoColOffset.test_end_lineno_string ________________________________

    @staticmethod
    def test_end_lineno_string() -> None:
        """FormattedValue, JoinedStr."""
        code = textwrap.dedent(
            """
        f"Hello Worl’: {42.1234:02d}"  #@
        f"Hello: {name=}"  #@
        """
        ).strip()
        ast_nodes = builder.extract_node(code)
        assert isinstance(ast_nodes, list) and len(ast_nodes) == 2
    
        s1 = ast_nodes[0]
        assert isinstance(s1, nodes.JoinedStr)
        assert isinstance(s1.values[0], nodes.Const)
        assert (s1.lineno, s1.col_offset) == (1, 0)
>       assert (s1.end_lineno, s1.end_col_offset) == (1, 29)
E       assert (1, 31) == (1, 29)
E         At index 1 diff: 31 != 29
E         Use -v to get more diff

tests/unittest_nodes_lineno.py:974: AssertionError
========================================= short test summary info =========================================

Expected behavior

That the unit test passes.

python -c "from astroid import __pkginfo__; print(__pkginfo__.version)" output

2.13.0-dev0, on 492a4bfde15fdd4e14de7a93a160a9fafa6ecf46 with Python 3.10.5.

teake avatar Aug 18 '22 09:08 teake

Thank you for opening the issue. This could be related to the python ast directly, but I did not investigate in detail.

Pierre-Sassoulas avatar Aug 18 '22 11:08 Pierre-Sassoulas

I did a little digging after being triggered by https://github.com/PyCQA/pylint/issues/8072#issuecomment-1385692465. Astroid uses ast.parse from the stdlib, and the returned ast.expr instances use byte offsets, as described in the Python docs (emphasis mine):

[...] the col_offset and end_col_offset are the corresponding UTF-8 byte offsets of the first and last tokens that generated the node.

So that rules out any bug in ast.parse of the stdlib, and explains the behaviour in the opening post.

But I'm not sure what Astroid / Pylint expects, byte offsets or character offsets. As an end-user of Pylint, I would expect character offsets.

teake avatar Jan 20 '23 14:01 teake

Thank you for the investigation, it might be a bug in pylint/astroid after all, we do need char offset as you say.

Pierre-Sassoulas avatar Jan 20 '23 14:01 Pierre-Sassoulas

Hi everyone, Any news about this bug?

Turiok avatar May 09 '23 12:05 Turiok

Hi @Turiok , no one worked on it yet, all contributions are welcome :)

Pierre-Sassoulas avatar May 09 '23 12:05 Pierre-Sassoulas

I don't really understand the issue? If you change the character to a character to a character with a different byte offset the offset changes. ast uses bytes so astroid also does so. I don't think there is much to do for us here? This seems like a limitation in the stdlib.

DanielNoord avatar May 10 '23 15:05 DanielNoord

Hi,

I thought pylint reported the column of the error in number of characters, not its size.

usinelogicielle avatar May 30 '23 13:05 usinelogicielle

We report the column of the error as it is given to us from the stdlib of CPython/PyPy. We don't do any location determination ourselves.

DanielNoord avatar May 30 '23 13:05 DanielNoord