semgrep icon indicating copy to clipboard operation
semgrep copied to clipboard

Treat whitespace as significant inside strings?

Open sourcefrog opened this issue 1 year ago • 3 comments

Is your feature request related to a problem? Please describe.

I want to match the contents of strings, treating whitespace as significant.

For example, in Terraform, I want to look for strings that contain just a single variable interpretation: matching "${$THING}" but not " ${$THING}".

It looks like semgrep always ignores whitespace?

https://semgrep.dev/playground/s/Dbnlo

Describe the solution you'd like

Maybe whitespace should always be significant inside strings? Or, maybe there should be an option to match it?

At least, it would be nice if whitespace handling was better documented.

Describe alternatives you've considered

I tried some variations with matching using regexps, but it's not great, and seems to give up a lot of the power of semgrep to understand a language's quoting rules.

Use case

I wanted to autofix "${thing}" to just thing. But more generally, I feel like understanding whitespace in strings correctly can be important for some security or code style rules.

Additional context

sourcefrog avatar Jul 05 '24 19:07 sourcefrog

I think the alternative we'd recommend is to match a metavariable + regex: https://semgrep.dev/docs/writing-rules/rule-syntax#metavariable-regex

Would that work for you? then you can still get all the semgrep matching power + autofix.

ievans avatar Jul 09 '24 19:07 ievans

I did try that but I think it's worth another go.

In general, falling back to regexps has the same problem with matching using regexps rather than semgrep: unwanted sensitivity to different syntactic differences. For example, for quoted strings, in different languages there are single vs double quotes, heredocs, triple quotes, raw strings, multi-line continuations, indented strings, etc, and accounting for them all in a regexp seems hard to get right, especially if you're trying to autofix. But in the specific case I was trying to fix here, where I'm only looking for single-line double quoted strings perhaps it will work fine.

sourcefrog avatar Jul 10 '24 21:07 sourcefrog

What I ended up with is

---
rules:
  # This generates a deprecation warning in Terraform >0.11; this rule is here mostly so that
  # we can autofix it.
  #
  # Example:
  # │ Warning: Quoted references are deprecated
  # │
  # │   on base.tf line 2, in data "aws_kms_alias" "instance_store_kms_east_qa":
  # │    2:   provider = "aws.east"
  # │
  # │ In this context, references are expected literally rather than in quotes.
  # │ Terraform 0.11 and earlier required quotes, but quoted references are now
  # │ deprecated and will be removed in a future version of Terraform. Remove the
  # │ quotes surrounding this reference to silence this warning.
  # │
  # │ (and 12 more similar warnings elsewhere)
  - id: unnecessary-quoted-reference
    message: >-
      Avoid quoted strings that contain just a single variable reference: just reference
      the variable instead.
    languages:
      - terraform
    severity: INFO
    patterns:
      # Per https://semgrep.dev/docs/writing-rules/rule-syntax#patterns-operator-evaluation-strategy,
      # the _positive_ patterns including `pattern` and `pattern-regex` are all evaluated and then
      # intersected with each other. The goal is to find semantically quoted strings (not e.g. inside a heredoc,
      # but possibly inside an interpolation expression) that are syntactically a single variable expansion
      # with no whitespace.
      - pattern: |
          "${$VAR}"
      # Only where the interpolation inserts a single named (and potentially dotted) variable
      - metavariable-regex:
          metavariable: '$VAR'
          regex: '^[A-Za-z0-9_.-]+$'
      # And only where the whole pattern is a single expansion with no whitespace.
      - pattern-regex: '"\$\{[A-Za-z0-9_.-]+\}"'
      # And, you can't just replace it on the left hand side of an object value, because
      # these are interpreted like symbols not variables.
      - pattern-not-inside: |
          {
            "${$VAR}": ...
          }
      - pattern-not-inside: |
          {
            "${$VAR}" = ...
          }
    # If everything matched, replace it with a simple variable reference.
    fix: $VAR

It looks like this does work, and thanks for the encouragement to try more with the regexp, but it feels a bit complicated. Both the metavariable regex and the pattern regex are needed to correctly filter down to the patterns that can be safely replaced.

sourcefrog avatar Jul 10 '24 22:07 sourcefrog