trufflehog
trufflehog copied to clipboard
Use re-readable reader and common chunker
This PR has two parts. Adding re-readable readers and adding a common chunking function.
Part 1: File handlers require that readers be read more than once. Some sources readers are more difficult than others to use efficiently. For example, an HTTP response could be fully read and stored in memory, or two requests could be made, but both have their pitfalls. Using disk-buffer-reader makes it so we can use the same method everywhere. It allows readers to be reset to the beginning by utilizing a tmp file as a buffer.
Part 2: Each source that deals with files re-implements chunking. A common chunking function gives us consistency and is easier to use.
I'm curious about the implications this will have on the speed of chunking. I don't think it's a blocker as the current bottleneck is in detection, but just curious.
Shouldn't we use the TeeReader instead of copying to disk? https://stackoverflow.com/a/39792097/11976023
@mcastorina In most cases nothing extra is being written to disk, so there is no performance hit.
@dustin-decker I was originally going to use TeeReader, but wanted to use something that could provide an interface that would work like a standard reader. TeeReader writes to the writer as the first reader is read from. Since the first reader is not guaranteed to be fully read, the two readers need to be joined with something like io.MultiReader. That combined with concerns about memory usage if a large file is tee'd made me figure out another option.