pydocstyle icon indicating copy to clipboard operation
pydocstyle copied to clipboard

Pydocstyle crashes on literal strings in pyproject.toml

Open codejedi365 opened this issue 2 years ago • 0 comments

Problem

When an unrelated configuration (ex semantic_release) has a literal string, such as a regular expression, in the configuration denoted by """, pydocstyle will throw a toml.decoder.TomlDecodeError for an unterminated string. This likely does not happen with every literal string but causes errors when there is a single quote inside the regexp.

My offending config:

# pyproject.toml

[tool.semantic_release]
version_pattern = [
    # regular expression to find version value in `_version.py` file
    '''src/pkg1/_version.py:__version__[ ]*[:=][ ]*["'](\d+\.\d+\.\d+)["']'''
]

[tool.pydocstyle]
convention = 'pep257'

Log

(venv) $ pydocstyle scripts/prepare.py

Traceback (most recent call last):
  File "/workspaces/py-rpm/venv/bin/pydocstyle", line 8, in <module>
    sys.exit(main())
  File "/workspaces/py-rpm/venv/lib/python3.8/site-packages/pydocstyle/cli.py", line 75, in main
    sys.exit(run_pydocstyle())
  File "/workspaces/py-rpm/venv/lib/python3.8/site-packages/pydocstyle/cli.py", line 41, in run_pydocstyle
    for (
  File "/workspaces/py-rpm/venv/lib/python3.8/site-packages/pydocstyle/config.py", line 288, in get_files_to_check
    config = self._get_config(os.path.abspath(name))
  File "/workspaces/py-rpm/venv/lib/python3.8/site-packages/pydocstyle/config.py", line 369, in _get_config
    config = self._get_config_by_discovery(node)
  File "/workspaces/py-rpm/venv/lib/python3.8/site-packages/pydocstyle/config.py", line 318, in _get_config_by_discovery
    config = self._get_config(parent_dir)
  File "/workspaces/py-rpm/venv/lib/python3.8/site-packages/pydocstyle/config.py", line 369, in _get_config
    config = self._get_config_by_discovery(node)
  File "/workspaces/py-rpm/venv/lib/python3.8/site-packages/pydocstyle/config.py", line 312, in _get_config_by_discovery
    config_file = self._get_config_file_in_folder(path)
  File "/workspaces/py-rpm/venv/lib/python3.8/site-packages/pydocstyle/config.py", line 555, in _get_config_file_in_folder
    if config.read(full_path) and cls._get_section_name(config):
  File "/workspaces/py-rpm/venv/lib/python3.8/site-packages/pydocstyle/config.py", line 70, in read
    self._config.update(toml.load(fp))
  File "/workspaces/py-rpm/venv/lib/python3.8/site-packages/toml/decoder.py", line 156, in load
    return loads(f.read(), _dict, decoder)
  File "/workspaces/py-rpm/venv/lib/python3.8/site-packages/toml/decoder.py", line 362, in loads
    raise TomlDecodeError("Unterminated string found."
toml.decoder.TomlDecodeError: Unterminated string found. Reached end of file. (line 121 column 1 char 2619)

Investigation

This seems to be a limitation of the parser implementation and associated TOML standard. I looked at the dependency trees of semantic_release and found that they use the library tomlkit instead of toml because it supports v1.0.0 of the TOML standard instead of v0.5.0. Under the hood, it seems there is a few flaws with the parser in toml==0.5.0 since I can change the regular expression in different variations and get different but not obvious/expected results. One such oddity, inside a the triple single quotes ''' if you have two double quotes " somewhere within it, it will cause an Unterminated string error, but if only one exists it is fine. The other variation that shouldn't work but does, is escaping the double quotes (ie. \") and it is fine.

I also found that the toml library itself is stale and has not received any updates since Oct 2020. Whereas tomlkit and its competitor tomli have both received updates in the 1st half of 2022. Furthermore, python3.11 also highlights these two frontrunners as the ideal libraries to read/write toml in the Python docs. Maybe in a year future you can use the python3.11 built-in library tomllib but clearly that would be incompatible for a few years.

Additional discussion on TOML support for raw/literal strings: https://github.com/toml-lang/toml/issues/80

Recommendation

Switch toml dependency to tomlkit or tomli.

I have tested both of the variations tomli==2.0.1 and tomlkit==0.10.2 and both parse my pyproject.toml configuration file (as provided above) with regex correctly without error. tomlkit does seem to be leading in popularity but the tomli documentation is a bit better. Also of note, tomli.load() requires the file to have been opened for reading in bytes instead of a specified encoding.

Related: #599

codejedi365 avatar Jul 01 '22 22:07 codejedi365