schemachange icon indicating copy to clipboard operation
schemachange copied to clipboard

Failure to redact a multi-line secret

Open rwberendsen opened this issue 11 months ago • 8 comments

Describe the bug Given this config file

config-version: 1

vars:
  secrets:
    azure_ad_certificate: |-
      foobarfoobarfoobar
      foobarfoobarfoobar

schemachange does print out the secret unredacted to stdout.

To Reproduce Steps to reproduce the behavior:

  1. Use the above config file
  2. Perform a dry run using the above config line
  3. Observe the unredacted secret being printed out

Expected behavior The secret should be printed as '*' characters, as many characters as the multiline secret has.

Screenshots Screenshot 2024-03-18 at 15 20 19

Schemachange (please complete the following information):

  • Version (e.g. 3.5.3): 3.6.1

Additional context Root cause hypothesis: The problem may occur because here:

https://github.com/Snowflake-Labs/schemachange/blob/de2bca1d10e960bf6d9d6580b8ade664fda581aa/schemachange/cli.py#L887

the variable dictionary is serialised as a YAML string, before secrets are redacted. But the serialisation may be altering the multiline string with respect to its representation as a deserialised string when it is added to the secrets here:

https://github.com/Snowflake-Labs/schemachange/blob/de2bca1d10e960bf6d9d6580b8ade664fda581aa/schemachange/cli.py#L802 .

Further testing would be needed to confirm.

As a direction for a fix, perhaps one idea is to mask the secret values in the deserialised representation, in the dictionary. Then serialise the dictionary that already contains masked secrets as a YAML. If this option is chosen, then also some of the tests will need to be rewritten, the only thing needing testing for the secret manager class itself is whether given input dictionaries with secrets it can produce matching output dictionaries with correctly masked secrets.

rwberendsen avatar Mar 18 '24 14:03 rwberendsen

SecretManager.global_redact is also called to redact SQL queries. So reducing its functionality to just redacting a dictionary may not suffice, at least not without enabling the same redacting use cases the class is supporting now.

rwberendsen avatar Mar 18 '24 15:03 rwberendsen

The reason is that YAML dump inserts leading spaces after newllines. But the deserialised version does not have leading spaces.

rwberendsen avatar Mar 18 '24 16:03 rwberendsen

A possible fix that would leave the SecretManager class unaltered would be to use the global_redact function differently when printing the variables to stdout. Rather than dumping config['vars'] as a YAML string and then redacting it, we could iterate over the dictionary and redact all str (non-dict) values in it. Then dump the result as a YAML.

The end result would be that a multiline secret gets transformed into a single line, because each \n character would also be converted to *.

rwberendsen avatar Mar 18 '24 16:03 rwberendsen

Another option would be to construct the replacement string in such way that newlines are preserved. This would make redacted output prettier, ar the expense of revealing the location of newlines in a mutiline secret. For example, in the extreme case where the secret contains only newlines, you would reveal the entire secret. Please give a thumbs up to the proposal you like best.

rwberendsen avatar Mar 18 '24 16:03 rwberendsen

Guess my proposal would be to also match indented multiline secrets; and to preserve newlines in masked output

rwberendsen avatar Mar 18 '24 17:03 rwberendsen

It seems in this line the assumption is that all secrets are of data type str:

https://github.com/Snowflake-Labs/schemachange/blob/de2bca1d10e960bf6d9d6580b8ade664fda581aa/schemachange/cli.py#L792

Would it be an idea to log a warning if a value is of a different type? e.g., an int, float, list, or what not? Or even go one step further and change the behaviour for this case, and not add the secret to the extracted secrets dictionary unless it is a str?

This is a side note, but, if it would be safe to assume secrets are strings, it just becomes easier to reason about. We don't have to worry about things like numbers looking different in the original config file vs how they look after being YAML loaded and then serialised again either as YAML or rendered by Jinja2 in SQL.

rwberendsen avatar Mar 19 '24 08:03 rwberendsen

Actually, schemachange crashes if you put, say, an int, as the value for the secret in the schema-config.yml. So, refraining from adding non-str values to the extracted_secrets set should not leave many folks with unmasked secrets, after they upgrade to a version that would include that change in behaviour.

rwberendsen avatar Mar 19 '24 09:03 rwberendsen

Have created a pull request from my fork for this issue, please review if you feel this issue should be tackled: https://github.com/Snowflake-Labs/schemachange/pull/238/files

rwberendsen avatar Mar 20 '24 09:03 rwberendsen