amp-toolbox-php
amp-toolbox-php copied to clipboard
mb_detect_encoding is very slow
Profiling the current test suite showed that the use of mb_detect_encoding() made up more than 40% of the execution time of the entire test suite.
We should:
- [ ] ensure we only use it if really necessary to decrease the amount of times it runs
- [x] document that it is slow and that documents should therefore use a proper charset meta tag to begin with
- [ ] make sure a charset is set in integration like the WP plugin to avoid its use altogether in those cases
It seems like this will become even more of a problem with PHP 8.1+, as the detection mechanism has changed to be more precise.
Re:
ensure we only use it if really necessary to decrease the amount of times it runs
@ediamin Can you please document the exact scenarios here where mb_detect_encoding is being used, and then see whether its usage can be reduced in some way?
Currently we are using mb_detect_encoding only in DocumentEncoding document filter. It'll use to detect the encoding of the document if:
- There is no charset meta tag present in head and we do not set any charset when we use the
Documentclass. For example,
$source = '<!DOCTYPE html><html><head></head><body>hello world</body></html>';
$charset = '';
$document = Document::fromHtml($source, $charset);
- We set
autoas the charset either in source or Document param,
$source = '<!DOCTYPE html><html><head></head><body>hello world</body></html>';
$charset = 'auto';
// or
$source = '<!DOCTYPE html><html><head><meta charset="auto"></head><body>hello world</body></html>';
$charset = '';
Additionally if we do not set charset as utf-8 which is required by the AMP, the source will convert to utf-8 from the provided encoding. So for the optimized performance, we should always use the UTF-8 charset in the source or Document param, or at least provide a valid charset other than the auto.
This is something we should keep in mind and actually flag as an issue in the PXE analysis.