amp-toolbox-php icon indicating copy to clipboard operation
amp-toolbox-php copied to clipboard

mb_detect_encoding is very slow

Open schlessera opened this issue 4 years ago • 3 comments

Profiling the current test suite showed that the use of mb_detect_encoding() made up more than 40% of the execution time of the entire test suite.

We should:

  • [ ] ensure we only use it if really necessary to decrease the amount of times it runs
  • [x] document that it is slow and that documents should therefore use a proper charset meta tag to begin with
  • [ ] make sure a charset is set in integration like the WP plugin to avoid its use altogether in those cases

schlessera avatar Mar 04 '21 18:03 schlessera

It seems like this will become even more of a problem with PHP 8.1+, as the detection mechanism has changed to be more precise.

Re:

ensure we only use it if really necessary to decrease the amount of times it runs

@ediamin Can you please document the exact scenarios here where mb_detect_encoding is being used, and then see whether its usage can be reduced in some way?

schlessera avatar Oct 19 '21 15:10 schlessera

Currently we are using mb_detect_encoding only in DocumentEncoding document filter. It'll use to detect the encoding of the document if:

  1. There is no charset meta tag present in head and we do not set any charset when we use the Document class. For example,
$source = '<!DOCTYPE html><html><head></head><body>hello world</body></html>';
$charset = '';
$document = Document::fromHtml($source, $charset);
  1. We set auto as the charset either in source or Document param,
$source = '<!DOCTYPE html><html><head></head><body>hello world</body></html>';
$charset = 'auto';
// or
$source = '<!DOCTYPE html><html><head><meta charset="auto"></head><body>hello world</body></html>';
$charset = '';

Additionally if we do not set charset as utf-8 which is required by the AMP, the source will convert to utf-8 from the provided encoding. So for the optimized performance, we should always use the UTF-8 charset in the source or Document param, or at least provide a valid charset other than the auto.

ediamin avatar Nov 02 '21 10:11 ediamin

This is something we should keep in mind and actually flag as an issue in the PXE analysis.

schlessera avatar Nov 02 '21 14:11 schlessera