html5-php icon indicating copy to clipboard operation
html5-php copied to clipboard

Get rid of ctype dependency

Open alecpl opened this issue 6 years ago • 9 comments

This library uses ctype_alpha() function in two places.

  1. If I'm not wrong this function is locale dependent and I don't see any setlocale() calls. So, would be better to have something that does not depend on locale.
  2. If we got rid of this function use the library would not need Ctype extension which is not always installed by default.

I think it should be easy to replace this function with something based on ord() or str[c]spn().

alecpl avatar Nov 20 '19 18:11 alecpl

I see that the function is used with single-character input. For this use-case the best I could find is:

function is_alpha($input)
{
    $code = ord($input);
    return ($code >= 97 && $code <= 122) || ($code >= 65 && $code <= 90);
}

It is 2x slower than ctype_alpha() however for 1 million calls it's 0.08 sec. vs 0.04 sec. So, I'd go for it.

alecpl avatar Nov 22 '19 12:11 alecpl

Will be happy to get rid of the dependency.

Can we make this an optional dependency, using something as function_exists, or use a static closure with the dependency already resolved to a polyfill function or to the extension?

goetas avatar Nov 22 '19 15:11 goetas

Polyfill exists - https://github.com/symfony/polyfill-ctype/blob/master/Ctype.php. But it is slower than my version which is optimized for use with single byte string.

As for function_exists... because of the potential locale-related issue I would rather get rid of ctype_alpha. Using my method will be an easy change. Maybe it would be possible to get rid of that function by changing the code that uses it, e.g. using some Scanner methods instead, but at a first look it does not seem to be that simple.

alecpl avatar Nov 25 '19 07:11 alecpl

I have two solutions, both have 2% impact on performance:

  1. Use a simple method in place of ctype_alpha:
--- a/src/HTML5/Parser/Tokenizer.php
+++ b/src/HTML5/Parser/Tokenizer.php
@@ -137,7 +137,7 @@ class Tokenizer
                 $this->endTag();
             } elseif ('?' === $tok) {
                 $this->processingInstruction();
-            } elseif (ctype_alpha($tok)) {
+            } elseif ($this->is_alpha($tok)) {
                 $this->tagName();
             } else {
                 $this->parseError('Illegal tag opening');
@@ -347,7 +347,7 @@ class Tokenizer
         // > -> parse error
         // EOF -> parse error
         // -> parse error
-        if (!ctype_alpha($tok)) {
+        if (!$this->is_alpha($tok)) {
             $this->parseError("Expected tag name, got '%s'", $tok);
             if ("\0" == $tok || false === $tok) {
                 return false;
@@ -1186,4 +1186,10 @@ class Tokenizer
 
         return '&' . $entity;
     }
+
+    protected function is_alpha($input)
+    {
+        $code = ord($input);
+        return ($code >= 97 && $code <= 122) || ($code >= 65 && $code <= 90);
+    }
 }

  1. Rewrite the code to not check if current char is alpha:
--- a/src/HTML5/Parser/Tokenizer.php
+++ b/src/HTML5/Parser/Tokenizer.php
@@ -137,9 +137,7 @@ class Tokenizer
                 $this->endTag();
             } elseif ('?' === $tok) {
                 $this->processingInstruction();
-            } elseif (ctype_alpha($tok)) {
-                $this->tagName();
-            } else {
+            } elseif (false === $this->tagName()) {
                 $this->parseError('Illegal tag opening');
                 // TODO is this necessary ?
                 $this->characterData();
@@ -343,20 +341,24 @@ class Tokenizer
         }
         $tok = $this->scanner->next();
.
-        // a-zA-Z -> tagname
-        // > -> parse error
         // EOF -> parse error
         // -> parse error
-        if (!ctype_alpha($tok)) {
+        if ("\0" == $tok || false === $tok) {
+            $this->parseError("Expected tag name, got '%s'", $tok);
+
+            return false;
+        }
+
+        // Get tag name (or just it's start)
+        $name = $this->scanner->getAsciiAlpha();
+
+        if (false === $name || '' === $name) {
             $this->parseError("Expected tag name, got '%s'", $tok);
-            if ("\0" == $tok || false === $tok) {
-                return false;
-            }
.
             return $this->bogusComment('</');
         }
.
-        $name = $this->scanner->charsUntil("\n\f \t>");
+        $name .= $this->scanner->charsUntil("\n\f \t>");
         $name = self::CONFORMANT_XML === $this->mode ? $name : strtolower($name);
         // Trash whitespace.
         $this->scanner->whitespace();
@@ -379,8 +381,15 @@ class Tokenizer
      */
     protected function tagName()
     {
-        // We know this is at least one char.
-        $name = $this->scanner->charsWhile(':_-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz');
+        // Get tag name (or just it's start)
+        $name = $this->scanner->getAsciiAlpha();
+
+        if (false === $name || '' === $name) {
+            return false;
+        }
+
+        // Get the rest of tag name
+        $name .= $this->scanner->charsWhile(':_-' . Scanner::CHARS_ALNUM);
         $name = self::CONFORMANT_XML === $this->mode ? $name : strtolower($name);
         $attributes = array();
         $selfClose = false;

The second patch might require some more love. For example I see that Scanner methods return false on EOF. I think this is not really needed (could return an empty string) and not properly documented.

alecpl avatar Nov 25 '19 09:11 alecpl

Extending the first patch with static function_exists check for ctype_alpha does not really have an impact on performance. It's still 1-2 % slower.

Performance test done using run.php script on PHP 7.3.

alecpl avatar Nov 25 '19 09:11 alecpl

To have any performance benefits, you'd probably have to do that in the constructor and assign a callable to a property to avoid additional function calls. Also, the polyfill would have to support ctype_alpha's signature to avoid an additional wrapper function.

mundschenk-at avatar Nov 25 '19 09:11 mundschenk-at

We can also use https://packagist.org/packages/symfony/polyfill-ctype instead.

IonBazan avatar Feb 03 '20 06:02 IonBazan

Hm, that seems nice

goetas avatar Feb 03 '20 08:02 goetas

@alecpl ~~can you try by using \ord() in your first solution instead ? The ord() function is one of the compiler-optimized functions that gets a special opcode when its usage does not rely on the fallback to the global function.~~ should not change anything here. The optimized case is when the argument is a static string, not a variable.

stof avatar Jun 09 '21 17:06 stof