semgrep icon indicating copy to clipboard operation
semgrep copied to clipboard

AST-based autofix

Open nmote opened this issue 3 years ago • 4 comments

This implements end-to-end AST-based autofix in semgrep-core. There are still numerous unhandled cases. The cases that I did handle are enough to illustrate the design, and are enough to address the specific bug reported in https://github.com/returntocorp/semgrep/issues/4964. By design, unhandled cases do not lead to crashes or incorrect results: Semgrep simply falls back on the existing text-based autofix.

I have left detailed comments inline, so I won't repeat myself here. In particular, for an outline of the design, see Autofix.ml.

I made changes to Map_AST and Visitor_AST to support visiting more nodes that are needed to perform autofix. I expect that we will need to add support for more nodes to these if we don't switch to autogenerated visitors via ppx deriving visitors.

I also added a fix and languages field to rule_id. I'm not sure if this is the best way to plumb that information down to JSON_report, but there is precedent for that decision with the message field. I'm not clear on why this data structure exists, rather than passing the full rule around.

Incorrect ranges attached to AST nodes will cause issues with this approach. Hopefully, the validation pass at the end will catch most of them. We'll address these issues as they come up, and we may also start using the ranges provided by tree sitter directly rather than reconstructing them from tokens. Incorrect ranges also cause problems with the existing text-based autofix.

I'm comfortable with the existing tests for this change, but as we improve autofix I think we'll want some better testing infrastructure. The existing end-to-end tests are thorough, but they are a bit clunky. It would be nice to have semgrep-core tests that verify that AST-based autofix is working correctly.

The snapshot change is caused by an additional log line that gets printed when we parse the fix pattern. It uses a temp file, and Semgrep prints a log line when that temp file is removed.

Test plan:

  • Existing automated tests, including e2e tests that test autofix.
  • Manually ran against the example in https://github.com/returntocorp/semgrep/issues/4964 (used only the first function call in the target file due to an unrelated issue interfering with the second) and verified that the extraneous comma is not added.
  • Some ad-hoc performance testing against the youtube-dl repository, enough to convince myself that all the parsing and matching logic dwarfs any additional work we're doing here for autofix. I can put together something more rigorous if reviewers would like.

PR checklist:

  • [x] Purpose of the code is evident to future readers
  • [x] Tests included or PR comment includes a reproducible test plan
  • [x] Documentation is up-to-date
  • [x] A changelog entry was added to changelog.d for any user-facing change
  • [x] Change has no security implications (otherwise, ping security team)

If you're unsure on any of this, please see:

nmote avatar Sep 21 '22 00:09 nmote

Sure, will do. This isn't ready for review yet by the way, that's why I have it marked as a draft. I just wanted to get the full CI running.

nmote avatar Sep 21 '22 15:09 nmote

Looks great! I've requested changes, but it's mostly for adding more comments in the .mli so the whole thing is clearer by just looking at the types and the comments in the .mli

aryx avatar Sep 22 '22 09:09 aryx

Thanks for the prompt and thorough review! I've responded inline and I'll go through and make the changes as soon as I can.

nmote avatar Sep 22 '22 16:09 nmote

I don't feel strongly. I'm fine with your design choices, I just wanted to point out alternatives (and maybe ask you to explain more those alternatives in comments and how your design improves over those naive alternatives). Looks good otherwise.

aryx avatar Sep 22 '22 16:09 aryx

Thanks, both of you, for your thorough reviews! I believe that I have addressed your concerns by improving readability and explanations. Let me know what you think!

nmote avatar Sep 27 '22 15:09 nmote

probably need to merge develop for fixing the pytest regressions.

aryx avatar Sep 27 '22 15:09 aryx

Just had to handle the new log line for autofix failure in the e2e test

nmote avatar Sep 27 '22 15:09 nmote

I've responded to your comments and made some changes. Back to you for another round!

nmote avatar Sep 28 '22 16:09 nmote

Merged!

aryx avatar Sep 28 '22 18:09 aryx