black Replace the blib2to3 tokenizer with pytokens

Description

Replaces black's tokenizer with a from-scratch rewrite done by me. We could vendor the code into black itself, but either pinning it or keeping it as-is would be my recommendation, the tokenizer can be used by multiple tools for perfect compatibility.

Resolves #4520 Resolves #970 Resolves #3700

Tests passing so far: 381/381 (!)

Dec 22 '24 14:12 tusharsadhwani

@JelleZijlstra with this, the test suite is fully passing. Primer is failing (mostly just because some file in hypothesis failed to parse), I'll be working on that, but the code should be good for a first review.

Dec 23 '24 23:12 tusharsadhwani

I don't think we currently have any tests for this, but I just linked the above two issues here because they are the same bug in the parser where \rs cause issues. The most minimal reproduction is {\r}, and since this is a parser rewrite hopefully it can be solved. Note that this is currently only observable by directly calling internal methods due to how black reads input.

Dec 24 '24 06:12 MeGaGiGaGon

Thanks for linking this, I'll make sure these parse identically to how CPython does it.

Dec 24 '24 10:12 tusharsadhwani

Okay, primer is fixed, and all tests are green.

Dec 25 '24 07:12 tusharsadhwani

Might this fuzzer failure indicate a bug?

Falsifying example: test_idempotent_any_syntatically_valid_python(
    src_contents='\n#\r#',
    mode=Mode(target_versions=set(),
     line_length=88,
     string_normalization=False,
     is_pyi=False,
     is_ipynb=False,
     skip_source_first_line=False,
     magic_trailing_comma=False,
     python_cell_magics=set(),
     preview=False,
     unstable=False,
     enabled_features=set()),  # or any other generated value
)

Jan 11 '25 04:01 JelleZijlstra

Might this fuzzer failure indicate a bug?

Yeah, but I'm pretty sure it is a bug in CPython. For now we can work it out in the tokenizer though. I'll add a flag in pytokens to fix it.

Jan 11 '25 04:01 tusharsadhwani

I found another issue that this should close, https://github.com/psf/black/issues/2318 Edit: Or if not close, at least be mentioned in

Jan 25 '25 17:01 MeGaGiGaGon

@MeGaGiGaGon no, that's about the parser, not the tokenizer. It would still be good long term to switch to a different parser, because the current parser only supports Python's recent grammar changes through a series of hacks.

Jan 28 '25 01:01 JelleZijlstra

This seems awesome! Could be worth doing a quick benchmark though with a mypyc build (I guess we could potentially mypyc pytokens if there is a perf difference). I think to do that you have to set these env vars https://github.com/psf/black/blob/main/pyproject.toml#L181 (but honestly not too sure how hatch-mypyc is setup)

Jan 28 '25 02:01 hauntsaninja

@JelleZijlstra edge cases have been taken care of.

@hauntsaninja I just tested mypyc on pytokens locally, and it got 2x faster at tokenizing the black/profiling folder.

Feb 19 '25 16:02 tusharsadhwani

Resolves #4588 too

Feb 26 '25 13:02 cobaltt7

@JelleZijlstra can you confirm if the PR is good to review, or are changes needed?

Mar 15 '25 13:03 tusharsadhwani

@JelleZijlstra I don't have merge rights just yet :P

Mar 15 '25 19:03 tusharsadhwani

Sorry I had to resolve a merge conflict and then wait for CI

Mar 16 '25 00:03 JelleZijlstra

black black copied to clipboard

Replace the blib2to3 tokenizer with pytokens

Description

Tests passing so far: 381/381 (!)

black
black copied to clipboard