sentry-java icon indicating copy to clipboard operation
sentry-java copied to clipboard

Correctly filter URL with an email address in it

Open krystofwoldrich opened this issue 1 year ago • 12 comments

Description

A related issue in Dart.

  • https://github.com/getsentry/sentry-dart/issues/1418

This came up as an issue in sentry-dart but as the logic is the same it's likely an issue in sentry-java too.

We should check it and fix it if needed.

krystofwoldrich avatar May 02 '23 19:05 krystofwoldrich

Java SDK is affected as well, here's a test:

    @Test
    fun `keeps email address`() {
        val urlDetails = UrlUtils.parseNullable(
            "https://staging.server.com/api/v4/auth/password/reset/[email protected]"
        )!!
        assertEquals("https://staging.server.com/api/v4/auth/password/reset/[email protected]", urlDetails.url)
        assertNull(urlDetails.query)
        assertNull(urlDetails.fragment)
    }

with result:

Expected :https://staging.server.com/api/v4/auth/password/reset/[email protected]
Actual   :https://[Filtered]@example.com

adinauer avatar May 03 '23 04:05 adinauer

@adinauer will you fix it right away? people that rely on transaction name and span desc from this parser, this will be broken, if you figure out how, I can just replicate on Dart.

marandaneto avatar May 05 '23 11:05 marandaneto

Wasn't planning to do this soon. We can bump prio if you want.

adinauer avatar May 08 '23 05:05 adinauer

Wasn't planning to do this soon. We can bump prio if you want.

I believe the priority is rather important, otherwise, every span and transaction name will be wrong and not identified and they all be grouped together because the filtered URL is always the same every time the parsing breaks.

marandaneto avatar May 08 '23 09:05 marandaneto

Yeah, erring on the other side would mean leaking authority of URLs tho, so we have to be careful.

adinauer avatar May 08 '23 09:05 adinauer

Well the logic to strip out the authority is there, it's just a matter of fixing the parsing to keep the domain and its sub/folders, it's a bug in the parsing, not a matter of PII anymore.

marandaneto avatar May 08 '23 09:05 marandaneto

Sure it's a bug but fixing it may break the filtering. It's not quite easy to determine whether something is part of an email or authority of an URL as there doesn't seem to be a clear set of characters we can use.

Take e.g. this URL https://[email protected] there's no clear way to tell whether staging.server.com?email=someone is the username of the authority section and example.com is the host or whether staging.server.com is the host and [email protected] is the query string. There's https://www.rfc-editor.org/rfc/rfc1738 pages 18-20:

user           = *[ uchar | ";" | "?" | "&" | "=" ]
password       = *[ uchar | ";" | "?" | "&" | "=" ]
uchar          = unreserved | escape
unreserved     = alpha | digit | safe | extra
alpha          = lowalpha | hialpha
lowalpha       = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" |
                 "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" |
                 "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" |
                 "y" | "z"
hialpha        = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
                 "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
                 "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
digit          = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
                 "8" | "9"
safe           = "$" | "-" | "_" | "." | "+"
extra          = "!" | "*" | "'" | "(" | ")" | ","
escape         = "%" hex hex

So we could use / and # to help the regex but that doesn't solve https://[email protected]. Best appraoch will be to check if we can rewrite the code to use java.net.URL as it seems to be able to figure this out somehow. Haven't looked at the code in detail but seems a bit complicated at first glance.

adinauer avatar May 08 '23 10:05 adinauer

@adinauer yes, I'd suggest using the builtin Uri or URI class depending on what we need, on Dart, the Uri works pretty well:

  test('test', () {
    final uri =
        Uri.parse('https://staging.server.com/[email protected]');
    uri.queryParameters;

    uri.queryParameters.forEach((key, value) {
      expect(key, 'email');
      expect(value, '[email protected]');
    });
  });

So to be honest, this is there already, maybe we don't even need our own regexes.

marandaneto avatar May 08 '23 10:05 marandaneto

@adinauer What places should we scrubble the email for? Do you have any more insight on this?

stefanosiano avatar Nov 08 '23 14:11 stefanosiano

Should be a change in UrlUtils by using e.g. java.net.URL and letting it take care of parsing the URL.

adinauer avatar Nov 09 '23 09:11 adinauer

@adinauer And is it only for OkHttp integration? Or other integrations like apollo need it?

stefanosiano avatar Nov 09 '23 09:11 stefanosiano

This should be the same for all integrations using UrlUtils

adinauer avatar Nov 09 '23 11:11 adinauer