big-list-of-naughty-strings icon indicating copy to clipboard operation
big-list-of-naughty-strings copied to clipboard

JSON cannot represent some naughty strings

Open ekimekim opened this issue 9 years ago • 8 comments

One of my favourite naughty strings is invalid utf-8 - for example, a bare \xff. It's quite common to get 500s,etc on these as no-one ever bothers to check for unicode decoding errors. However, because JSON requires all strings to be valid utf-8, this example is only able to be included in the txt file.

Would this be something worth including and adding a special case in the script to omit from the json? Or is it too naughty for blns?

EDIT: This HN comment (https://news.ycombinator.com/item?id=10035738) suggested having the JSON file be of b64 encoded strings. This is a good suggestion, and allows arbitrary naughty bytes to be used, at the cost of readability.

ekimekim avatar Aug 11 '15 08:08 ekimekim

I am open to a seperate .json file for b64 strings. I'll look into it today.

minimaxir avatar Aug 11 '15 14:08 minimaxir

Bear in mind that base 64 represents bytes, not strings, and that strings always have an encoding.

"It does not make sense to have a string without knowing what encoding it uses."

There is no such thing as an invalid byte sequence; however, that sequence may or may not properly conform to an encoding. If you go the route of encoding the strings as base 64, you may need to consider encoding the string in a few different ways; say, ASCII if applicable, UTF8, and UTF16 (maybe ISO 8859-1 (Latin 1), Windows 1252 (Western European)?).

floyd-may avatar Aug 18 '15 17:08 floyd-may

Yes, that's my point. Many systems assume all input is in a particular encoding (commonly utf-8) and may break if that is not the case. However, JSON cannot represent arbitrary byte content without some other form of encoding such as base64. I'll leave the bytes/strings distinction aside as it is an entirely semantic argument.

ekimekim avatar Aug 18 '15 17:08 ekimekim

I disagree that the bytes/strings distinction is entirely semantic. Passing invalidly-encoded bytes versus pathological (but valid) cases that should be handled properly should behave differently in most systems. My vote would be to have each ~~string~~sample decorated in some way with its encoding (or null if it isn't valid). That way, invalid data (versus unusual data) can easily be identified. For example:

[
    { data: "<base 64 encoded stuff>", encoding: "ASCII" },
    { data: "<base 64 encoded stuff>", encoding: "UTF-16" },
    { data: "<base 64 encoded stuff>", encoding: "UTF-8" },
    { data: "<base 64 encoded stuff>", encoding: null },
    // naturally, lots and lots more
]

floyd-may avatar Aug 18 '15 19:08 floyd-may

I don't see the problem with JSON being limited to only valid text, when the text file is already only valid text... I mean, a byte equal to 0 is already forbidden, let alone nice combinations of broken UTF-8 or UTF-16 that would be interesting to have.

Of course that would require to introduce some sort of escaping, which would require adjusting the file to the new format. I guess that if Max didn't introduce the list with that format there are other considerations I didn't realize, right?

suy avatar Jan 02 '16 16:01 suy

It's always possible to have two lists with the legacy-compatible subset being merged into the machine-readable version of the full list by the build process.

ssokolow avatar Jan 02 '16 17:01 ssokolow

Just so that's clear, JSON does not require UTF-8 any longer, but valid unicode. The UTF-8 requirement is from an older version of JSON.

Doesn't change much about this topic, but it's worth noting.

jfinkhaeuser avatar Jan 02 '16 17:01 jfinkhaeuser

Given that some software environments conflate bytes and strings, or naively assume ASCII (or well-formed UTF-8, or whatever), I think it makes perfect sense to include byte sequences in here that cause decoding issues.

I don't think that specifying an encoding is necessary, but having a comment explaining what the point of the byte sequence is would be useful—and that could mention the specific encoding that is being targeted.

timmc avatar May 24 '20 22:05 timmc