commons-lang
commons-lang copied to clipboard
LANG-1400: Add StringUtils.mask() function
My alternative implementation for #332 See https://issues.apache.org/jira/browse/LANG-1400 for details
Coverage increased (+0.02%) to 95.247% when pulling c81ae1d9857fa84952ac763879beed4598d2b371 on stokito:LANG-1400 into 8e8b8e05e4eb9aa009444c2fea3552d28b57aa98 on apache:master.
Hello @stokito, as mentioned in #332, would you consider an extra parameter, to clearly specify the minimum number of masked characters? From your implementation I can see that it is enforced, but it is limited, and it is not documented clearly either in the javadoc or by the parameter names.
The extra parameter can also make the method not being implicit on the unmasking of start/end (// we prefer to unmask from start and mask end if str len is smaller
) and choose to equally subtract from unmaskedStart
and unmaskedEnd
to enforce the minimum amount of masked characters.
Hi @greenman18523
would you consider an extra parameter, to clearly specify the minimum number of masked characters? For those use cases which I mentioned (masking credit cards and passwords) this looks not needed for me. Maybe you know some cases when this may be needed?
As I understood you are telling about more safety and do not unmask any symbol if incoming string is too short while implementation which I proposed will try to show at least some symbols from start.
For example mask("123456", 4, 4) = "12****"
which makes hidden symbols more guessable.
But, to be honest, if someone uses so short password then it doesn't matter if it will be shown.
Another solution in this case we can mask everything when str len is 6 < unmaskendStart 4 + unmaskedEnd 4. I.e. mask("123456", 4, 4) = "******"
. This is easier to understood but in the same time it still may be useful to unmask at least something but I don't think it's so critical.
What do you think about this proposition? E.g.
mask("12345678", 4, 4) = "********"
mask("123456789", 4, 4) = "1*********"
mask("1234567890", 4, 4) = "1234**7890"
mask("12345678901234", 4, 4) = "1234******1234"
I hope that unmaskedStart
and unmaskedEnd
in real life will be always reasonable (1-6) and the incoming string will be always bigger. We can actually restrict passing strings less that some length and throw an exception.
But from possible use cases it looks that mask()
function should be failsafe because it may be used just for logging of external input which can be anything and we shouldn't break it's processing. I even think about returning an empty string if null was passed.
Also we have to think about performance because I expect that the function will be widely used for in logging filters for any incoming request.
Hello @stokito
As I understood you are telling about more safety and do not unmask any symbol if incoming string is too short while implementation which I proposed will try to show at least some symbols from start. For example mask("123456", 4, 4) = "12****" which makes hidden symbols more guessable. But, to be honest, if someone uses so short password then it doesn't matter if it will be shown.
Yes, safety is my main concern. But in cases of arbitrary length data (e.g. names, addresses, messages), it's harder to say that one approach (the one implicitly specified in the method) on how many chars are masked or not is the correct. Also as a programmer I would like to have the flexibility to specify it, since regulations can differ world-wide.
But since your approach is meaningful in cases of specific-length data, I guess we can have two method, sharing one implementation.
public static String mask(final String str, int unmaskedStart, int unmaskedEnd, final char mask)
and
public static String mask(final String str, int unmaskedStart, int unmaskedEnd, int minMasked, final char mask)
with one calling the other.
I agree that the methods should be failsafe, otherwise we will need to place boilerplate code before calling them and I think it should be a one-liner, since as you say, main usage will be in logs.
Performance wise I think it is ok, and above all better to not rush on this matter, since it might not be needed.
P.S. I wouldn't use this for passwords, as anything can tip a malicious 3rd party, even the length. Better to not print anything.
Sounds reasonable in the same time it will not always obviously which minMasked to use. For example already mentioned credit card numbers in fact can have varying length as I remember up to 24 digits. As I understood you created this function for a specific case which you had. Can you provide some examples from real life when minMasked was needed?
it will not always obviously which minMasked to use.
I am sorry, but I didn't understand this, can you please further elaborate? minMasked
will be one, and it will be the absolute minimum of characters that will be masked.
Examples are SMS messages, e.g. some messages are short and contain one time passwords, or links to re-assign password. Such messages are generally very short, and since they have sensitive data, nothing can be logged, thus we have a minMasked value of 30. Also addresses, depending on how the user inputs it, this can be very few characters (usually just the street and the street number) or many (city, prefecture, country, etc), so when we print part of the address we need at least the first 15 characters to be masked.
The use cases might be specific, but considering this is part of a generic API the method should be flexible, in order to cover a wide variety of usecases.
Thank you for your examples. Just to clarify: maskedStart
from this PR corresponds to minMasked
from #332 but the discussion is about adding a maxUnmasked
, right?
some messages are short and contain one time passwords
it doesn't need for a masking at all. Short term generated values, OTPs and tokens like OAuth access_token
(but not refresh_token
) are safe to write to logs. If hacker stole logs we will have nothing to do with the data.
have a
minMasked
value of 30
Why exactly 30? What if sensitive data will be still shown in unmasked part? For this kind of data it's better to mask whole string.
when we print part of the address we need at least the first 15 characters to be masked
If I understood correctly, it's enough maskedStart
of 15 and maxUnmasked
not needed in the case.
a generic API the method should be flexible, in order to cover a wide variety of usecases.
What I'm afraid is that making the API too general leads to spending a time to read the docs, or incorrect understanding and misuse. As a developer I can clearly know how much chars to mask because I'm expect some kind of string but how much chars will be in real string is unknown variable for me, so maxUnmasked
will be always a speculation. For example java's StringBuilder
has a constructor with initial capacity and from my experience programmers are never even try to use it while this is critical for performance. So adding a new parameter maxUnmasked
may be not so useful in real life but confusing.
Just to clarify: maskedStart from this PR corresponds to minMasked from #332 but the discussion is about adding a maxUnmasked, right?
I don't see a maskedStart
parameters, these are the parameters I see in this PR: final String str, int unmaskedStart, int unmaskedEnd, final char mask
If you are referring to unmaskedStart
, then this is a parameter with a dual, non-obvious role, which is not good. There should be different parameters for different roles. If I a missing something from this PR, please elaborate.
some messages are short and contain one time passwords
it doesn't need for a masking at all. Short term generated values, OTPs and tokens like OAuth access_token (but not refresh_token) are safe to write to logs. If hacker stole logs we will have nothing to do with the data.
No, you must not write any sensitive data, even if they are valid for half a nanosecond. Every tiny info a 3rd party gains gives them more leverage on how to further proceed. Hackers don't always steal old logs, they can live monitor your logs also! For OTPs, there should be different endpoints where logging of messages is disabled. But, just in case something like that goes through the main channels. 30 is a best-effort estimation, based on the currently seen patterns for OTPs.
a new parameter maxUnmasked may be not so useful in real life but confusing.
I am referring to a minMasked
parameter, which by it's name states it's purpose. Which is the absolute minimum number of characters that must be masked.
Sounds reasonable in the same time it will not always obviously which minMasked to use. For example already mentioned credit card numbers in fact can have varying length as I remember up to 24 digits. As I understood I created this function for specific case which you had. Can you provide some examples from real life when minMasked was needed?
FYI, card numbers can vary from 13 to 19 digits. PCI DSS allows to log first 6 digits (BIN) and last 4. Here is proper maskifying Scala code:
val START_SYMBOLS = 6
val END_SYMBOLS = 4
val TOTAL_SYMBOLS: Int = START_SYMBOLS + END_SYMBOLS
private def stars(count: Int): String = "*" * count
def maskify(number): String =
if (number.length <= 6) {
number
} else {
val maskLength = (number.length - TOTAL_SYMBOLS).max(0)
val numberFromEnd = maskLength + START_SYMBOLS
number.substring(0, 6) + stars(maskLength) + number.substring(numberFromEnd)
}
As a result:
4242424242424242 ~> 424242******4242
424242424242424242 ~> 424242********4242
Any plans on this?
Needs discussion, see mailing list WRT to what belongs in Commons Lang vs. Commons Text.