trufflehog Use re-readable reader and common chunker

Use re-readable reader and common chunker

Open bill-rich opened this issue 2 years ago • 0 comments

This PR has two parts. Adding re-readable readers and adding a common chunking function.

Part 1: File handlers require that readers be read more than once. Some sources readers are more difficult than others to use efficiently. For example, an HTTP response could be fully read and stored in memory, or two requests could be made, but both have their pitfalls. Using disk-buffer-reader makes it so we can use the same method everywhere. It allows readers to be reset to the beginning by utilizing a tmp file as a buffer.

Part 2: Each source that deals with files re-implements chunking. A common chunking function gives us consistency and is easier to use.

Aug 09 '22 21:08 bill-rich

I'm curious about the implications this will have on the speed of chunking. I don't think it's a blocker as the current bottleneck is in detection, but just curious.

Aug 10 '22 16:08 mcastorina

Shouldn't we use the TeeReader instead of copying to disk? https://stackoverflow.com/a/39792097/11976023

Aug 10 '22 16:08 dustin-decker

@mcastorina In most cases nothing extra is being written to disk, so there is no performance hit.

@dustin-decker I was originally going to use TeeReader, but wanted to use something that could provide an interface that would work like a standard reader. TeeReader writes to the writer as the first reader is read from. Since the first reader is not guaranteed to be fully read, the two readers need to be joined with something like io.MultiReader. That combined with concerns about memory usage if a large file is tee'd made me figure out another option.

Aug 10 '22 17:08 bill-rich

trufflehog trufflehog copied to clipboard

Use re-readable reader and common chunker

trufflehog
trufflehog copied to clipboard