tartufo icon indicating copy to clipboard operation
tartufo copied to clipboard

Make tartufo multi-threaded

Open gforsythe-godaddy opened this issue 3 years ago • 1 comments

Feature Request

Add threading or concurrency to tartufo

Is your feature request related to a problem? Please describe.

Not a problem per-se, but on a monorepo with a long history, running tartufo can take a LONG time. I happened to open htop and noticed only a single CPU core was being used (essentially wasting 11 cores). I would guess making tartufo multi-threaded would decrease not only the time for a scan, but also amount of $$$ spent on CICD instances on AWS.

Describe the solution you'd like

Preferrably either discoverable (via /proc) or number of threads via command line option. I have not done any python coding for a loooong time so I am not sure what is available or if this is possible. I would assume there would be a parent thread which perhaps farms out chunks of commits to child threads or something like that.

Describe alternatives you've considered

I suppose it might be possible to run separate instances of tartufo on different branches, but in my specific case I'm running on a bare mirror of a repo so branches are not available.

Teachability, Documentation, Adoption, Migration Strategy

Should be backwards compatible... may be good to make multi-threading the default option so everyone would just profit.

gforsythe-godaddy avatar Feb 08 '22 02:02 gforsythe-godaddy

I think this would be a great feature. I ran tartufo on a local repo with 4k commits and it took about 112 seconds.

I don't necessarily think the threading module is the way to go, as threads are still bound by the Python GIL so we would not see any significant increase in the number of cores used. The multiprocessing module however, seems to fit the bill quite well. I would like to see a cli option to specify the number of cores to use with the default being all of them like @gforsythe-godaddy mentioned.

A global interpreter lock (GIL) is a mechanism used in computer-language interpreters to synchronize the execution of threads so that only one native thread (per process) can execute at a time.[1] An interpreter that uses GIL always allows exactly one thread to execute at a time, even if run on a multi-core processor.

I am interested in the development of this feature, a multiprocessing.Queue could be used to handle IPC. It would probably be a pretty significant rewrite, but could also be a significant performance increase.

Testing

I wrote a small script to determine the legitimacy of these claims.

EXPAND: View the script
import time
import tracemalloc
from multiprocessing import Process
from threading import Thread
from typing import Type


def do_work() -> None:
    """Simulates CPU intensive work."""
    x = [0]

    for i in range(10_000_000):
        j = x.pop()
        k = i + j - 100
        x.append(k)


def run_worker(worker: Type[Thread | Process]) -> None:
    """Creates 10 workers and runs the `do_work` function in each."""
    tracemalloc.start()
    start = time.perf_counter()
    workers: list[Thread | Process] = []

    for _ in range(10):
        w = worker(target=do_work)
        workers.append(w)
        w.start()

    for w in workers:
        w.join()

    end = time.perf_counter()
    duration = end - start
    current_mem, peak_mem = tracemalloc.get_traced_memory()
    tracemalloc.stop()

    print(f"Method:            {worker.__name__}")
    print(f"Completion time:   {duration:.2f} seconds")
    print(f"Current mem usage: {current_mem} bytes")
    print(f"Peak mem usage:    {peak_mem} bytes")
    print()


if __name__ == "__main__":
    run_worker(Thread)
    run_worker(Process)
EXPAND: View the script output
Method:            Thread
Completion time:   118.48 seconds
Current mem usage: 50535 bytes
Peak mem usage:    54471 bytes

Method:            Process
Completion time:   13.29 seconds
Current mem usage: 241608 bytes
Peak mem usage:    248048 bytes
Method:            Thread
Completion time:   115.34 seconds
Current mem usage: 50103 bytes
Peak mem usage:    54307 bytes

Method:            Process
Completion time:   12.63 seconds
Current mem usage: 242043 bytes
Peak mem usage:    248048 bytes
Method:            Thread
Completion time:   118.38 seconds
Current mem usage: 49847 bytes
Peak mem usage:    54307 bytes

Method:            Process
Completion time:   12.94 seconds
Current mem usage: 241555 bytes
Peak mem usage:    247560 bytes
Method:            Thread
Completion time:   118.65 seconds
Current mem usage: 49847 bytes
Peak mem usage:    54307 bytes

Method:            Process
Completion time:   12.35 seconds
Current mem usage: 242003 bytes
Peak mem usage:    248008 bytes
Method:            Thread
Completion time:   121.74 seconds
Current mem usage: 50535 bytes
Peak mem usage:    54471 bytes

Method:            Process
Completion time:   12.55 seconds
Current mem usage: 241555 bytes
Peak mem usage:    248048 bytes

As you can see, the multiprocessing implementation is much faster. Multiprocessing ran approximately 10x faster. Threading used approximately 5x less memory.

CPU usage with threading: image

CPU usage with multiprocessing: image

I noticed the current implementation of ScannerBase uses a threading.Lock, but after a search of the repo it doesn't look like any multithreading is being used and so the lock is not really doing anything in the program. The lock is used to stop other threads from accessing the block encompassed in the context manager, but there are no other threads to compete for the lock.

$ find ./tartufo -type f -not -path "*.pyc" -exec grep -iHn "threading" {} \;
./tartufo/scanner.py:11:import threading
./tartufo/scanner.py:146:    _scan_lock: threading.Lock = threading.Lock()

jhall1-godaddy avatar Jun 02 '22 15:06 jhall1-godaddy