black icon indicating copy to clipboard operation
black copied to clipboard

Replace the blib2to3 tokenizer with pytokens

Open tusharsadhwani opened this issue 11 months ago • 10 comments

Description

Replaces black's tokenizer with a from-scratch rewrite done by me. We could vendor the code into black itself, but either pinning it or keeping it as-is would be my recommendation, the tokenizer can be used by multiple tools for perfect compatibility.

Resolves #4520 Resolves #970 Resolves #3700

Tests passing so far: 381/381 (!)

tusharsadhwani avatar Dec 22 '24 14:12 tusharsadhwani

@JelleZijlstra with this, the test suite is fully passing. Primer is failing (mostly just because some file in hypothesis failed to parse), I'll be working on that, but the code should be good for a first review.

tusharsadhwani avatar Dec 23 '24 23:12 tusharsadhwani

I don't think we currently have any tests for this, but I just linked the above two issues here because they are the same bug in the parser where \rs cause issues. The most minimal reproduction is {\r}, and since this is a parser rewrite hopefully it can be solved. Note that this is currently only observable by directly calling internal methods due to how black reads input.

MeGaGiGaGon avatar Dec 24 '24 06:12 MeGaGiGaGon

Thanks for linking this, I'll make sure these parse identically to how CPython does it.

tusharsadhwani avatar Dec 24 '24 10:12 tusharsadhwani

Okay, primer is fixed, and all tests are green.

tusharsadhwani avatar Dec 25 '24 07:12 tusharsadhwani

Might this fuzzer failure indicate a bug?

Falsifying example: test_idempotent_any_syntatically_valid_python(
    src_contents='\n#\r#',
    mode=Mode(target_versions=set(),
     line_length=88,
     string_normalization=False,
     is_pyi=False,
     is_ipynb=False,
     skip_source_first_line=False,
     magic_trailing_comma=False,
     python_cell_magics=set(),
     preview=False,
     unstable=False,
     enabled_features=set()),  # or any other generated value
)

JelleZijlstra avatar Jan 11 '25 04:01 JelleZijlstra

Might this fuzzer failure indicate a bug?

Yeah, but I'm pretty sure it is a bug in CPython. For now we can work it out in the tokenizer though. I'll add a flag in pytokens to fix it.

tusharsadhwani avatar Jan 11 '25 04:01 tusharsadhwani

I found another issue that this should close, https://github.com/psf/black/issues/2318 Edit: Or if not close, at least be mentioned in

MeGaGiGaGon avatar Jan 25 '25 17:01 MeGaGiGaGon

@MeGaGiGaGon no, that's about the parser, not the tokenizer. It would still be good long term to switch to a different parser, because the current parser only supports Python's recent grammar changes through a series of hacks.

JelleZijlstra avatar Jan 28 '25 01:01 JelleZijlstra

This seems awesome! Could be worth doing a quick benchmark though with a mypyc build (I guess we could potentially mypyc pytokens if there is a perf difference). I think to do that you have to set these env vars https://github.com/psf/black/blob/main/pyproject.toml#L181 (but honestly not too sure how hatch-mypyc is setup)

hauntsaninja avatar Jan 28 '25 02:01 hauntsaninja

@JelleZijlstra edge cases have been taken care of.

@hauntsaninja I just tested mypyc on pytokens locally, and it got 2x faster at tokenizing the black/profiling folder.

tusharsadhwani avatar Feb 19 '25 16:02 tusharsadhwani

Resolves #4588 too

cobaltt7 avatar Feb 26 '25 13:02 cobaltt7

@JelleZijlstra can you confirm if the PR is good to review, or are changes needed?

tusharsadhwani avatar Mar 15 '25 13:03 tusharsadhwani

@JelleZijlstra I don't have merge rights just yet :P

tusharsadhwani avatar Mar 15 '25 19:03 tusharsadhwani

Sorry I had to resolve a merge conflict and then wait for CI

JelleZijlstra avatar Mar 16 '25 00:03 JelleZijlstra