trufflehog
trufflehog copied to clipboard
Create decoder for HTML entities
Description:
This creates a decoder to handle HTML entities. Tests pass, but the implementation may not be the most efficient.
This fixes #2231.
Checklist:
- [ ] Tests passing (
make test-community
)? - [ ] Lint passing (
make lint
this requires golangci-lint)?
I think we've reached the point where we should consider adding a --enabled/disabled-decoders
flag, similar to what we have for detectors. This one seems pretty impactful on performance in its current state, and many data sources might not benefit much from it.
One potential improvement might be to implement this as a handler and do identification of the whole file before decoding and chunking it out.
I do worry about the impact of having too many decoders. At a minimum, having something like ahocorasick might be more efficient than checking regexp.Match()
against each chunk.
One potential improvement might be to implement this as a handler and do identification of the whole file before decoding and chunking it out.
While I think identifying the mimetype of a file would be a great addition (and make way for other enhancements), I'm not sure how much it would help in this case. HTML, Markdown, and AsciiDoc files are obviously sources that would benefit, but HTML-encoded content can show up in weird places like config files, .txt
files, or source code.
This decoder was act inspired by #1550; I found several live connection strings that were not detected by TruffleHog because they contained encoded &
instead of a literal &
.
mongodb://dave:password@localhost:27017/?authMechanism=DEFAULT&authSource=db&ssl=true"