Stub file(s) contains null bytes
Describe the bug
At least one of the .pyi files in rdkit-stubs appears to have some null bytes, which is verboten in Python world.
To Reproduce Create a minimal environment:
$ micromamba create --name rdkit-stub-bug "python=3.11" rdkit -c conda-forge
Run a type-checker on something that imports from rdkit.Chem.Draw:
$ cat script.py && mypy -v script.py
from rdkit.Chem.Draw import rdMolDraw2D
LOG: Could not load plugins snapshot: @plugins_snapshot.json
LOG: Mypy Version: 1.8.0
LOG: Config File: Default
LOG: Configured Executable: /Users/mattthompson/micromamba/envs/rdkit-stub-bug/bin/python
LOG: Current Executable: /Users/mattthompson/micromamba/envs/rdkit-stub-bug/bin/python
LOG: Cache Dir: .mypy_cache
LOG: Compiled: True
LOG: Exclude: []
LOG: Found source: BuildSource(path='script.py', module='script', has_text=False, base_dir='/Users/mattthompson/tmp', followed=False)
LOG: Could not load cache for script: script.meta.json
LOG: Metadata not found for script
LOG: Parsing script.py (script)
LOG: Could not load cache for rdkit.Chem.Draw.rdMolDraw2D: rdkit/Chem/Draw/rdMolDraw2D.meta.json
LOG: Metadata not found for rdkit.Chem.Draw.rdMolDraw2D
LOG: Parsing /Users/mattthompson/micromamba/envs/rdkit-stub-bug/lib/python3.12/site-packages/rdkit-stubs/Chem/Draw/rdMolDraw2D.pyi (rdkit.Chem.Draw.rdMolDraw2D)
LOG: Could not load cache for rdkit.Chem.Draw: rdkit/Chem/Draw/__init__.meta.json
LOG: Metadata not found for rdkit.Chem.Draw
LOG: Parsing /Users/mattthompson/micromamba/envs/rdkit-stub-bug/lib/python3.12/site-packages/rdkit-stubs/Chem/Draw/__init__.pyi (rdkit.Chem.Draw)
LOG: Bailing due to parse errors
LOG: Build finished in 0.020 seconds with 2 modules, and 1 errors
/Users/mattthompson/micromamba/envs/rdkit-stub-bug/lib/python3.12/site-packages/rdkit-stubs/Chem/Draw/__init__.pyi: error: source code string cannot contain null bytes [syntax]
Found 1 error in 1 file (errors prevented further checking)
I couldn't find this particular error in the issue tracker - and if the stubs are a recent addition to builds, maybe I'm the first person to observe this?
Doing a little bit of digging, it seems like it's not quite that mypy is confused, but that the stub file has null bytes that Python doesn't like. Following this lead of this super helpful comment, I can reproduce this just attempting to parse this particular .pyi file (I didn't check any others):
>>> import ast
>>> ast.parse(open("/Users/mattthompson/micromamba/envs/rdkit-stub-bug/lib/python3.12/site-packages/rdkit-stubs/Chem/Draw/__init__.pyi").read())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mattthompson/micromamba/envs/rdkit-stub-bug/lib/python3.12/ast.py", line 52, in parse
return compile(source, filename, mode, flags,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SyntaxError: source code string cannot contain null bytes
I have seen similar behavior on 3.9, 3.10, and 3.11.
I've uploaded that particular file here (or at least uploaded what got on my clipboard): https://gist.github.com/mattwthompson/525e19973603073704f8c09307c889ac
The offending bit seems to be in the MolsMatrixToGridImage docstring, which looks funny in VS Code:
This one-liner seems to find a similar number of byte characters:
>>> len(open("/Users/mattthompson/micromamba/envs/rdkit-stub-bug/lib/python3.12/site-packages/rdkit-stubs/Chem/Draw/__init__.pyi").read().split('\x00'))
6
Here's where my digging ends for now - I see a lot of automation around building these stubs
Expected behavior
Having stubs distributed upstream is super helpful for downstream developers, so thanks for that. (I think we're just blanket ignoring everything RDKit in type-checking.) The expectation behavior here is more or less that mypy or a similar type-checker can parse the stub files so I can run them on my codebase.
Screenshots One embedded above
Configuration (please complete the following information):
- RDKit version:
rdkit 2023.09.6 py312h59770eb_0 conda-forge - OS: macOS something or other (with an ARM chip)
- Python version (if relevant): 3.12, but also seen with 3.9, 3.10, and 3.11
- Are you using conda? Yes (for some values of
conda) - If you are using conda, which channel did you install the rdkit from?
conda-forge
Additional context I didn't look through other stub files for null bytes, just reporting the first on my type-checker choked on.
@mattwthompson Thanks for reporting this.
What happens is that the \x## tokens in the MolsMatrixToGridImage docstring within Chem/Draw/__init__.py
# Prints a binary string: b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x03\x84'
are converted by pybind11-stubgen into actual binary codes:
# Prints a binary string: b'<U+0089>PNG
^Z
^@^@^@
IHDR^@^@^C<U+0084>'
which clearly should not happen. I'll fix that in my pre-processing workflow such that the \x## are escaped before being fed into pybind11-stubgen.