python-docs-pl icon indicating copy to clipboard operation
python-docs-pl copied to clipboard

Use pospell

Open m-aciek opened this issue 10 months ago • 11 comments

https://pypi.org/project/pospell/

There are Polish dictionaries available for hunspell (pospell), we could leverage it to improve the quality of the translation. It would require some configuration (extra custom dictionary and skipping code blocks). We could look at the other languages' setups.

% pospell --language pl tutorial/*.po
…
tutorial/stdlib2.po:701:heappop
tutorial/stdlib2.po:778:wywnioskowując
tutorial/stdlib2.po:778:Decimal
tutorial/stdlib2.po:791:modulo
tutorial/venv.po:35:Pythonowe
tutorial/venv.po:146:bash
tutorial/venv.po:187:deaktywować
tutorial/venv.po:199:pragramu
tutorial/venv.po:210:podkomend
tutorial/venv.po:210:install
tutorial/venv.po:210:uninstall
tutorial/venv.po:210:freeze
tutorial/venv.po:239:podajac
tutorial/whatnow.po:43:tutorial
tutorial/whatnow.po:77:Szegółowe
tutorial/whatnow.po:100:Cheese
tutorial/whatnow.po:111:Cookbook
tutorial/whatnow.po:111:Wydawnicto
tutorial/whatnow.po:111:Reilly
tutorial/whatnow.po:130:Scientific
tutorial/whatnow.po:172:Cheese

m-aciek avatar Apr 30 '25 12:04 m-aciek

It looks good though it may be annoying with Polishized words like Pythonowe and words like heappop? I will look into the other repos.

StanFromIreland avatar Apr 30 '25 20:04 StanFromIreland

python-docs-es has a nice solution: a Python script that merges (in runtime) a base dictionary (with common words for all docs) and per-doc dictionary, which reduce the duplication if you want a dictionary file per-doc and avoid a huge single-file dictionary.

rffontenelle avatar May 19 '25 16:05 rffontenelle

Bigger issue, pospell crash on a codeblock, how do we exclude them?:

<rst-doc>:7: (ERROR/3) Unexpected indentation. while parsing: class Parrot:
    def __init__(self):
        self._voltage = 100000
    @property
    def voltage(self):
        """Uzyska aktualne napięcie.""
        return self._voltage
Traceback (most recent call last):
<rst-doc>:3: (ERROR/3) Unexpected indentation. while parsing: # punkt to dwukrotka (x, y)
match point:
    case (0, 0):
        print("Początek")
    case (0, y):
        print(f"Y={y}")
    case (x, 0):
        print(f"X={x}")
    case (x, y):
        print(f"X={x}, Y={y}")
    case _:
        raise ValueError("Nie punkt")
  File "/opt/hostedtoolcache/Python/3.13.3/x64/bin/pospell", line 8, in <module>
    sys.exit(main())
             ~~~~^^
  File "/opt/hostedtoolcache/Python/3.13.3/x64/lib/python3.13/site-packages/pospell.py", line 480, in main
    errors = spell_check(
        args.po_file,
    ...<4 lines>...
        args.jobs,
    )
  File "/opt/hostedtoolcache/Python/3.13.3/x64/lib/python3.13/site-packages/pospell.py", line 384, in spell_check
    errors = flatten(
        pool.map(
    ...<2 lines>...
        )
    )
  File "/opt/hostedtoolcache/Python/3.13.3/x64/lib/python3.13/site-packages/pospell.py", line 342, in flatten
    return [element for a_list in list_of_lists for element in a_list]
                                                               ^^^^^^
TypeError: 'int' object is not iterable

StanFromIreland avatar May 29 '25 09:05 StanFromIreland

I believe Sphinx should be adding code-block flag to msgids made from code blocks in gettext builder. Then pospell should enable us to filter out those msgids from checking.

m-aciek avatar May 29 '25 09:05 m-aciek

Bigger issue, pospell crash on a codeblock, how do we exclude them?:

In python-docs-pt-br, when I was having tons of sphinx-lint errors because of literal-blocks being extracted, my work-around was the following: 1) make gettext disabling literal blocks to generate POT without it; 2) 'sphinx-intl update' to update PO files with the newly generated POT files; 3) run pospell; 4) discard changes to PO files (or simply don't commit).

rffontenelle avatar Jun 12 '25 18:06 rffontenelle

I believe Sphinx should be adding code-block flag to msgids made from code blocks in gettext builder.

@m-aciek Do you know whether there is an issued filed for this in Sphinx?

rffontenelle avatar Jun 23 '25 14:06 rffontenelle

I believe Sphinx should be adding code-block flag to msgids made from code blocks in gettext builder.

@m-aciek Do you know whether there is an issued filed for this in Sphinx?

There isn't yet, as far as I'm concerned.

m-aciek avatar Jun 23 '25 14:06 m-aciek

I could not find anything either. pospell has been recently improved to display multi-line msgs better tough, but still a lot to be done.

StanFromIreland avatar Jun 23 '25 14:06 StanFromIreland

Christmas came earlier this year: Gettext is implementing a custom sticky flag, which could be used to tell "this is a code-block". See https://lists.gnu.org/archive/html/bug-gettext/2025-06/msg00018.html

I already notified Transifex to support it. Should it be reported to pybabel, Sphinx or both to add support to this new feature?

rffontenelle avatar Jun 30 '25 17:06 rffontenelle

Thanks for sharing. Hm, so motivation to introduce second prefix is to increase reliability in tooling. I think it's definitely worth sharing with projects! Sphinx writes PO files directly, and uses gettext afaic, so rather doesn't need changes to support it. Pybabel parse files to update them, so probably should be changed.

By the way, current tooling should also correctly handle our custom flag, yet without this new syntax. Edit: hm, or not?

m-aciek avatar Jun 30 '25 22:06 m-aciek

Do you know whether there is an issued filed for this in Sphinx?

For the reference: https://github.com/sphinx-doc/sphinx/issues/13722

m-aciek avatar Jul 10 '25 02:07 m-aciek