trufflehog icon indicating copy to clipboard operation
trufflehog copied to clipboard

Create decoder for HTML entities

Open rgmz opened this issue 4 months ago • 2 comments

Description:

This creates a decoder to handle HTML entities. Tests pass, but the implementation may not be the most efficient.

This fixes #2231.

Checklist:

  • [ ] Tests passing (make test-community)?
  • [ ] Lint passing (make lint this requires golangci-lint)?

rgmz avatar Mar 10 '24 15:03 rgmz

I think we've reached the point where we should consider adding a --enabled/disabled-decoders flag, similar to what we have for detectors. This one seems pretty impactful on performance in its current state, and many data sources might not benefit much from it.

One potential improvement might be to implement this as a handler and do identification of the whole file before decoding and chunking it out.

dustin-decker avatar Mar 20 '24 15:03 dustin-decker

I do worry about the impact of having too many decoders. At a minimum, having something like ahocorasick might be more efficient than checking regexp.Match() against each chunk.

One potential improvement might be to implement this as a handler and do identification of the whole file before decoding and chunking it out.

While I think identifying the mimetype of a file would be a great addition (and make way for other enhancements), I'm not sure how much it would help in this case. HTML, Markdown, and AsciiDoc files are obviously sources that would benefit, but HTML-encoded content can show up in weird places like config files, .txt files, or source code.

This decoder was act inspired by #1550; I found several live connection strings that were not detected by TruffleHog because they contained encoded & instead of a literal &.

mongodb://dave:password@localhost:27017/?authMechanism=DEFAULT&authSource=db&ssl=true"

rgmz avatar Mar 21 '24 03:03 rgmz