bulk_extractor icon indicating copy to clipboard operation
bulk_extractor copied to clipboard

Update lightgrep scanner for bulk_extractor 2.0

Open juliapaluch opened this issue 2 years ago • 8 comments

This PR has the following functionality changes:

  • scan_lightgrep searches for user-specified keywords from the -f and -F options, by default searching for both UTF-8 and UTF-16LE versions, with case-sensitivity
  • The following scanners have been deleted:
    • scan_accts_lg
    • scan_base16_lg
    • scan_email_lg
    • scan_gps_lg

With the deletion of other lightgrep-based scanners, we were able to delete a lot of scaffolding code.

This PR is not yet ready, but we're opening it for comment. The following remains to be done:

  • [ ] Write build documentation
  • [ ] Specify a lightgrep release
  • [ ] Write scan_lightgrep usage documentation
  • [ ] Test the Windows build

Please let us know if you have any questions or comments.

juliapaluch avatar May 30 '23 18:05 juliapaluch

I'm going to close this and re-open it as a draft PR.

simsong avatar May 30 '23 23:05 simsong

Apparently that's not how you did it. I found instructions here. It's a draft now.

simsong avatar May 30 '23 23:05 simsong

Codecov Report

Merging #421 (16e8eeb) into main (7935c41) will not change coverage. The diff coverage is n/a.

@@           Coverage Diff           @@
##             main     #421   +/-   ##
=======================================
  Coverage   47.94%   47.94%           
=======================================
  Files         112      112           
  Lines       13224    13224           
=======================================
  Hits         6339     6339           
  Misses       6885     6885           

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov[bot] avatar May 31 '23 00:05 codecov[bot]

I didn't know about draft PRs, TIL.

jonstewart avatar May 31 '23 00:05 jonstewart

Is this PR ready to go?

simsong avatar Nov 09 '23 11:11 simsong

[jeez, terrible formatting for reply-by-email]

Good question: yes, and no.

We think this PR works, but it depends on the current main branch of lightgrep. To make for a good user experience, we need to release a new version of lightgrep and then update this PR with updated build scripts that can pull that release.

The current plan is to get the new release of lightgrep out before the end of the year. It has been under continual development for the past few months, as a ~25% time project. It has several minor improvements and bug fixes (per the spirit of the ACM paper). If you’ve got a specific date in mind for a new bulk_extractor release, that would be good to know and we may be able to adjust.

We are not entirely confident in our usage of the new sbuf/scanner API. We would love a code review of this PR from you. We could also push up the requisite lightgrep code for you to test, if you’d prefer. 

jonstewart avatar Nov 09 '23 12:11 jonstewart

Hi. What's the status on this?

simsong avatar Jan 07 '24 21:01 simsong

We're getting ready to make a new lightgrep release for this to target. Can you review scan_lightgrep.cpp?

jonstewart avatar Jan 07 '24 22:01 jonstewart