MIME: Introduce MIME type parser

Open dmsnell opened this issue 1 month ago • 6 comments

Trac ticket: Core-64427

Introduces the WP_Mime_Sniffer class for parsing MIME types from sources such as HTTP Content-Type headers, unknown binary files, and more.

WP_Mime_Sniffer::from_declaration( $supplied_type ) for decoding HTTP Content-Type headers, HTML <meta http-equiv> and <script type> tags, RFC 822 headers, and more, where the string is an affirmation of the type of content that should be contained within some associated resource.
WP_Mime_Sniffer::from_file( $file_path ) for inferring MIME type from the “resource header” of a file at the given path where harmonizing server and browser behaviors is warranted, largely to eliminate security vulnerabilities.
WP_Mime_Sniffer::from_binary_file_contents( $file_contents ) for the same, but when the file data has already been loaded, e.g. on media file upload or via HTTP GET.
$mime_type->serialize() to produce a normalized version of a potentially-malformed input.
$mime_type->minimize() to produce a privacy-sensitive stripped-down version of the MIME type suitable for use in APIs like PerformanceResourceTiming.
$mime_type->get_indicated_charset() to return a canonical character encoding referenced by the MIME type, if included and recognized.
A family of methods to indicate if a mime type is of a given common set, such as $mime_type->is_json() and $mime_type->is_javascript().

The ::declaring_javascript() and ::declaring_json() methods are interesting and might be worth emphasizing over from_declaration() if they stay in the patch. They only return a parsed MIME type if given something that matches those classes.

if ( WP_Mime_Sniffer::declaring_json( $content_type ) ) {
	$response = json_decode( $response );
}

[ ] Add ::from_http_headers_string( string $headers ) ?
[ ] Add ::from_http_headers_array( array<string> $headers ) ?

These two methods could ease code attempting to infer content type without needing to know the details surrounding Content-type parsing: in download_url(), in SimplePie, in discover_pingback_server_uri(), in wp_staticize_emoji_for_email() even! It would update WP_REST_Request::get_content_type() and wp_finalize_template_enhancement_output_buffer().

The Encoding part unlocks non-UTF-8 inputs in the HTML API for $this->bail( 'Cannot yet process META tags with http-equiv Content-Type to determine encoding.' );

Of the labeled encodings, they are mostly supported by the version of PHP running on my computer with mbstring and iconv extensions. Of the unsupported ones:

ISO-8859-8-I is a variant of ISO-8859-8 which might be textually identical and possibly only specified meta sequences based on the C0/C1 controls.
replacement groups security-risky encoding labels into a decoder that always fails. when decoded, the output is always '' (empty string).
x-user-defined is a mapping of non-US-ASCII bytes up by 0x4780 into the private-use area.

Dec 16 '25 21:12 dmsnell