black
black copied to clipboard
Replace the blib2to3 tokenizer with pytokens
Description
Replaces black's tokenizer with a from-scratch rewrite done by me. We could vendor the code into black itself, but either pinning it or keeping it as-is would be my recommendation, the tokenizer can be used by multiple tools for perfect compatibility.
Resolves #4520 Resolves #970 Resolves #3700
Tests passing so far: 381/381 (!)
@JelleZijlstra with this, the test suite is fully passing. Primer is failing (mostly just because some file in hypothesis failed to parse), I'll be working on that, but the code should be good for a first review.
I don't think we currently have any tests for this, but I just linked the above two issues here because they are the same bug in the parser where \rs cause issues. The most minimal reproduction is {\r}, and since this is a parser rewrite hopefully it can be solved. Note that this is currently only observable by directly calling internal methods due to how black reads input.
Thanks for linking this, I'll make sure these parse identically to how CPython does it.
Okay, primer is fixed, and all tests are green.
Might this fuzzer failure indicate a bug?
Falsifying example: test_idempotent_any_syntatically_valid_python(
src_contents='\n#\r#',
mode=Mode(target_versions=set(),
line_length=88,
string_normalization=False,
is_pyi=False,
is_ipynb=False,
skip_source_first_line=False,
magic_trailing_comma=False,
python_cell_magics=set(),
preview=False,
unstable=False,
enabled_features=set()), # or any other generated value
)
Might this fuzzer failure indicate a bug?
Yeah, but I'm pretty sure it is a bug in CPython. For now we can work it out in the tokenizer though. I'll add a flag in pytokens to fix it.
I found another issue that this should close, https://github.com/psf/black/issues/2318 Edit: Or if not close, at least be mentioned in
@MeGaGiGaGon no, that's about the parser, not the tokenizer. It would still be good long term to switch to a different parser, because the current parser only supports Python's recent grammar changes through a series of hacks.
This seems awesome! Could be worth doing a quick benchmark though with a mypyc build (I guess we could potentially mypyc pytokens if there is a perf difference). I think to do that you have to set these env vars https://github.com/psf/black/blob/main/pyproject.toml#L181 (but honestly not too sure how hatch-mypyc is setup)
@JelleZijlstra edge cases have been taken care of.
@hauntsaninja I just tested mypyc on pytokens locally, and it got 2x faster at tokenizing the black/profiling folder.
Resolves #4588 too
@JelleZijlstra can you confirm if the PR is good to review, or are changes needed?
@JelleZijlstra I don't have merge rights just yet :P
Sorry I had to resolve a merge conflict and then wait for CI