Certain languages can't be matched correctly by the Regular Expression EXT_VALUE_REGEXP like en-US or zh_cn
Hello, and very sorry about that! I tried to make the regular expression match the grammar as defined in RFC 5987. Looking at the spec, I don't believe either of those would be valid. Can you please point out where in the spec those would be valid values?
I'm so sorry, they seem to be out of this spec.I didn't know this The Language Tag Spec before.But why can the chrome browser can parse the attachment's name correctly when I access the dowload url by address bar even though the content-disposition field contains invalid information for languge?
I read the spec in https://datatracker.ietf.org/doc/html/rfc5646#section-2.1. the spec about the language region is 2ALPHA or 3DIGIT.In other words,en-US meets the language tag specification.But for EXT_VALUE_REGEXP, /-A-Za-z]{3}){0,3}/ states '-' must be followed by three characters.
I believe these are the relevant specifications:
Content-Disposition: https://httpwg.org/specs/rfc6266.html#header.field.definition
Links here for the value: https://www.rfc-editor.org/rfc/rfc5987.html#section-3.2
Links here for the language portion: https://www.rfc-editor.org/rfc/rfc5646#section-2.1
Which has some handy examples we can use as reference: https://www.rfc-editor.org/rfc/rfc5646#page-80
I am pretty sure this means that 'attachment; filename*=UTF-8\'en-US\'%E2%82%AC%20rates.pdf' (based on our one language test) should be valid. I am going to ping in some of the other folks from the express team to validate this, but if folks agree I would like to land this fix if we can before the next release. I think it would be considered a breaking change though, as many error cases would become suddenly parsed.
It seems like we throw out the language portion anyway, so I wonder if there is actually any reason we parse it? It is "technically correct" but maybe functionally useless which would then also be slower for no reason. Am I missing something else here?
Yes, 'attachment; filename*=UTF-8\'en-US\'%E2%82%AC%20rates.pdf' should be valid based on the spec.
To be clear, the bug is in the parse method, it will throw with a language tag w/ a non-three-letter subtag.
We can fix the regex to accept {2,3} for subtag, but that still doesn't cover all valid values such as what is in the examples given in rfc-editor.org/rfc/rfc5646#page-80.
It is true we, and most clients, dont use the language tag in these headers. They're intended to convey localization information as metadata, but nobody does that.
I see the following options:
-
Fix for the Most Common Case: Update the regex so that the subtag in the language portion accepts 2 to 3 letters (i.e.
{2,3}) instead of exactly 3. This would cover common tags like en-US, but not more complex examples from RFC 5646. -
Full RFC 5646 Compliance: Replace the current language portion of the regex with a more comprehensive pattern that supports the full variety of valid language tags (e.g. allowing 2-letter regions, 3-digit codes, 4-letter scripts, and longer variants as defined in RFC 5646). This ensures that any valid language tag will be accepted. I'd stop short at the grandfathered in tags, which by definition are irregular to the structure defined for language tags. This means we still won't be able to 100% validate language tags when parsing and will have some false negatives.
-
Ignore the Language Tag: Since the language tag is not used by the library (nor by most clients), modify the parser to simply discard or ignore the language tag regardless of its content. This would mean accepting any extended value that follows the correct charset and value-chars, even if the language portion doesn’t fully conform to RFC 5646.
Awesome, yes I am glad we agree then. So, for next steps I personally prefer number 3 because it would solve for the issues in number 2 (by ignoring it all) and also should enable a future version of this implementation to move away from slow regex to a faster parser which would also skip it (thus being more simple and a bit faster). Are there any downsides to option 3?
Only downside to option 3 is that we lose the ability to "validate" strictly the language tag. But it's
- such an esoteric usecase that I don't think anyone relies on that
- it was already broken, so anyone relying on that was relying on a tiny subset of the spec valid inputs
Here's my final writeup of how I think this came to be, for my own satisfaction
Bug Analysis: Content-Disposition Extended Filename Parsing
The Content-Disposition header supports an extended parameter (filename*) as defined by RFC 5987. This extended value consists of:
- A required charset
- An optional language tag
- A percent-encoded value
The extended parameter (filename*) is used when a filename contains non-ASCII characters, ensuring proper encoding and internationalization.
RFC 5987 defers to RFC 5646 for the syntax of language tags. RFC 5646 defines a full language tag (langtag), which may include:
- A primary language subtag (2–3 letters)
- An optional extended language (
extlang, up to two additional 3-letter subtags) - Optional subtags for:
- Script (e.g.,
"Hans") - Region (e.g.,
"US") - Variant, extension, and private-use
- Script (e.g.,
Root Cause of the Bug
The bug in our implementation originates from the overly restrictive regex used in the parse method. The regex was constructed by copying only a subset of RFC 5646's language tag definition:
language = 2*3ALPHA [ extlang ] / 4ALPHA / 5*8ALPHA
extlang = *3("-" 3ALPHA)
This approach omits the broader optional components of a full language tag such as script, region, variant, and private-use subtags.
Impact of the Bug
As a result, common and valid language tags like:
"en-US"(where"US"is a region subtag)"zh-Hans"(where"Hans"is a script subtag)
are incorrectly rejected because the regex expects any subtag following a 2–3 letter primary language to be exactly 3 letters (matching the extlang format). This narrow interpretation fails for many valid RFC 5646 language tags.
Summary
The core issue is that the implementation copied only a narrow portion of RFC 5646's language tag definition, ignoring the optional subtags. This resulted in an overly strict regex that rejects valid header values during parsing, despite them being fully compliant with the specification.