flake8 increase batches per worker, ~15% faster

Please describe how you installed Flake8

$ sudo apt install flake8. # Ubuntu 20.04.3. I might pip-install if this feature request gets released.

Please provide the exact, unmodified output of flake8 --bug-report

{
  "dependencies": [
    {
      "dependency": "entrypoints",
      "version": "0.3"
    }
  ],
  "platform": {
    "python_implementation": "CPython",
    "python_version": "3.8.10",
    "system": "Linux"
  },
  "plugins": [
    {
      "is_local": false,
      "plugin": "mccabe",
      "version": "0.6.1"
    },
    {
      "is_local": false,
      "plugin": "pycodestyle",
      "version": "2.5.0"
    },
    {
      "is_local": false,
      "plugin": "pyflakes",
      "version": "2.1.1"
    }
  ],
  "version": "3.7.9"
}

Please describe the problem or feature

Hi All. Thanks so much for maintaining flake8.

Edit: ignore the sorting/optimizing chunks part. It seems to possibly have a negative benefit on large code-bases, though maybe still be slightly helpful for smaller codebases, though probably isn't worth the complexity.

I think flake8 should increase batches per work and/or optimize chunks, possibly as settings.

I have 496 files in my project, many of them are small django migrations and __init__.py files, but there are a few really big ones (15482 tokens, ~1000 physical lines), which end up making one worker (out of 4) quit much later than the others, so the large chunks (2 per worker) aren't balanced.

In my case a normal flake8 run on my project takes ~3.8 seconds (tested 10 runs / 10), if I edit the code to force pool chunksize to 1 or 2, I get ~3.3 seconds, a 13% improvement in my case.

Another thing could help would be to stat all the files and sort by file size, biggest first, then deal them out to each worker to help keep avoid a worker getting too much work at the end. For me when the files are cached in memory, it takes 2-4ms to stat and sort the files. That makes a larger chunksize more worth it:

        # in run_parallel(self):
        import os
        chunksize = calculate_pool_chunksize(len(self.checkers), self.jobs)
        # sort files by file size, largest first
        self.checkers.sort(key=lambda checker: os.stat(checker.filename).st_size, reverse=True)
        # deal the files out to the worker chunks
        new_sorted_checkers = []
        chunks = [[] for x in range(self.jobs)]
        for n, checker in enumerate(self.checkers):
            worker = n % self.jobs
            chunks[worker].append(checker)
            if len(chunks[worker]) >= chunksize:
                new_sorted_checkers += chunks[worker]
                chunks[worker] = []
        for chunk in chunks:
            new_sorted_checkers += chunk
        self.checkers = new_sorted_checkers

With the sorting and dealing alone, using flake8's default of 2-chunks per worker (62 chunksize in my case), i get ~3.1 seconds (15% faster), which is a pretty good speedup. If I also force it to have a chunksize of 5, I get down to about ~2.9 seconds (23% faster), so increasing batches-per-worker or having a chunksize option would still be nice.

Here's my results for the django 3.2 codebase (2696 checkers/files, 4 workers/jobs, default chunksize=337):

Default settings: 37 seconds - chunksize=337 (~2 batches per worker)

Increasing batches/chunks per worker/job, not stating/sorting/dealing: 32 seconds - chunksize=168 (~4 batches per worker) 30.5 seconds - chunksize=94 (~8 batches per worker) 30 seconds - chunksize=42 (~16 batches per worker) 30.6 seconds - chunksize=21 (~32 batches per worker)

With stating/sorting/dealing: 30.9 seconds chunksize=337 (~2 batches per worker) (default batches per worker) 30.1 seconds - chunksize=168 (~4 batches per worker) 29.9 seconds - chunksize=84 (~8 batches per worker) 29.74 seconds - chunksize=42 (~16 batches per worker) 29.9 seconds - chunksize=21 (~32 batches per worker) (stating, sorting, and dealing alone takes ~14-19ms)

Update: results for cpython 3.7 (no sorting) (1856 checkers/files, 4 workers/jobs): 1m14.535s - 2 batches per worker (default) 1m5.341s - 4 batches per worker (12% faster) 1m2.735s - 8 batches per worker(~15.8% faster) 1m2.576s - 16 batches per worker(~16% faster) 1m2.541s - 32 batches per worker(~16% faster) 1m2.139s - 64 batches per worker(~17% faster)

At some point the benefits are within the margin of variance so it gets hard to measure, but somewhere around 16 batches-per-worker seems to be optimal in my case, and that alone, even without sorting by size etc, seems to really help.

Anyway I thought I'd post my findings in case anyone wants to experiment with these things themselves and possibly improve flake8.

If nothing else, changing the default "batches per worker" from 2 to 8 or 16 (in calculate_pool_chunksize) would be pretty simple and I think would really help keep work balanced (~15% faster).

Thanks, Collin

P.S. I'm realizing sorting by most-recently modified could also be helpful for quickly showing errors on recently edited files.

Oct 21 '21 18:10 collinanderson

interesting findings -- when I originally implemented the chunksize I tried a bunch of sizes and found 2 to be optimal

can you profile on different python versions and different varieties of files?

Oct 21 '21 19:10 asottile

oh, you're also using a really old version of flake8 so these measurements don't really help

Oct 21 '21 19:10 asottile

Here's flake8 4.0.1 on Python 3.9 checking the cpython main branch codebase: 64 batches per worker: 1m9.859s 32 batches per worker: 1m8.758s 16 batches per worker: 1m9.501s 8 batches per worker: 1m9.136s 4 batches per worker: 1m15.739s 2 batches per worker: 1m25.072s

(I only really did one or so run per test, but I do think that 2 and 4 seem to be slower than 8 or 16.)

Here's flake8 4.0.1 on Python 3.9 checking the https://github.com/edx/edx-platform main branch: 64 batches per worker: real 0m42.317s user 2m36.586s sys 0m0.647s ~20% faster 32 batches per worker: real 0m41.620s user 2m32.692s sys 0m0.852s ~21% faster 16 batches per worker: real 0m42.203s user 2m32.071s sys 0m0.839s ~20% faster 8 batches per worker: real 0m45.624s user 2m34.785s sys 0m0.843s ~13.8% faster 4 batches per worker: real 0m51.068s user 2m31.258s sys 0m0.891s ~3.5% faster 2 batches per worker: real 0m52.958s user 2m29.090s sys 0m1.032s (baseline)

(I did at least two runs for edx, trying to really get the best time for 2 batches per worker as a baseline)

I think it's just hard to keep work balanced when there's only 2 batches per worker.

$ ../venv3.9/bin/flake8 --bug-report
{
  "dependencies": [],
  "platform": {
    "python_implementation": "CPython",
    "python_version": "3.9.5",
    "system": "Linux"
  },
  "plugins": [
    {
      "is_local": false,
      "plugin": "mccabe",
      "version": "0.6.1"
    },
    {
      "is_local": false,
      "plugin": "pycodestyle",
      "version": "2.8.0"
    },
    {
      "is_local": false,
      "plugin": "pyflakes",
      "version": "2.4.0"
    }
  ],
  "version": "4.0.1"
}

Oct 21 '21 20:10 collinanderson

I think this doesn't help any more after #1723 -- notably the original performance problem due to the chatty nature is mitigated since only small strings are being passed over the pipe now

I originally did profile your branch and found the results to be a little inconclusive which is why I didn't merge it -- perhaps based on different processor counts or manufacturers?

either way I think this is "fixed" now

Oct 27 '22 16:10 asottile

Yes, I see imap_unordered is now just using the default chunksize of 1 (one batch per file), which 100% fixes my issue of keeping workers balanced. As you say as long as "only small strings are being passed over the pipe now" then having such a tiny chunksize (one batch per file) shouldn't cause too much overhead.

https://github.com/PyCQA/flake8/blob/b89d81a919da660c1b324c307321fd18eec8fcae/src/flake8/checker.py#L201

Thanks!

Oct 28 '22 01:10 collinanderson

flake8 flake8 copied to clipboard

increase batches per worker, ~15% faster

flake8
flake8 copied to clipboard