scraper HashSet<LocalName> for classes is slow

LocalName::new called for something that isn't in https://github.com/servo/html5ever/blob/master/markup5ever/local_names.txt locks a global Mutex. As LocalName::new is called for every class in Element::new, and most classes are unlikely to be in that list, this means that multithreading is much less of a win for HTML parsing than it should be. While it's a breaking change and probably not the best approach, I've switched to using Strings locally, which gives me about a 10% performance improvement. My program's performance is still dominated by HashSet construction, though. It might be faster to intern the entire HashSet<String>s per-document.

May 19 '20 21:05 leo60228

It would probably also be faster to use a different hasher for the HashSet. Here are some benchmarks: https://github.com/shepmaster/twox-hash#benchmarks

Or getting O(1) lookup through perfect hashing with https://crates.io/crates/phf

Or using a prefix tree (also called "trie", lookup in O(log n)), it seems there are many crates for this, e.g.: https://crates.io/crates/trie-rs https://crates.io/crates/tst https://crates.io/crates/prefix-tree https://crates.io/crates/qp-trie https://crates.io/crates/radix_trie

May 20 '20 04:05 Boscop

These (pretty hacky) changes shave about 20% off my program's runtime:

diff --git a/Cargo.toml b/Cargo.toml
index 56112ea..221e445 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -23,6 +23,7 @@ matches = "0.1.6"
 selectors = "0.22"
 smallvec = "1"
 tendril = "0.4"
+fnv = "1.0.7"
 
 [dependencies.getopts]
 version = "0.2.18"
diff --git a/src/node.rs b/src/node.rs
index 103fba8..bd255a9 100644
--- a/src/node.rs
+++ b/src/node.rs
@@ -1,7 +1,6 @@
 //! HTML nodes.
 
 use std::collections::{hash_map, hash_set};
-use std::collections::{HashMap, HashSet};
 use std::fmt;
 use std::ops::Deref;
 
@@ -10,6 +9,8 @@ use html5ever::{Attribute, LocalName, QualName};
 
 use selectors::attr::CaseSensitivity;
 
+use fnv::{FnvHashSet, FnvHashMap};
+
 /// An HTML node.
 #[derive(Clone, PartialEq, Eq)]
 pub enum Node {
@@ -231,13 +232,13 @@ pub struct Element {
     pub name: QualName,
 
     /// The element ID.
-    pub id: Option<LocalName>,
+    pub id: Option<String>,
 
     /// The element classes.
-    pub classes: HashSet<LocalName>,
+    pub classes: FnvHashSet<String>,
 
     /// The element attributes.
-    pub attrs: HashMap<QualName, StrTendril>,
+    pub attrs: FnvHashMap<QualName, StrTendril>,
 }
 
 impl Element {
@@ -246,16 +247,17 @@ impl Element {
         let id = attrs
             .iter()
             .find(|a| a.name.local.deref() == "id")
-            .map(|a| LocalName::from(a.value.deref()));
+            .map(|a| String::from(a.value.deref()));
 
-        let classes: HashSet<LocalName> = attrs
+        let classes: FnvHashSet<String> = attrs
             .iter()
             .find(|a| a.name.local.deref() == "class")
-            .map_or(HashSet::new(), |a| {
+            .map_or(FnvHashSet::default(), |a| {
                 a.value
                     .deref()
-                    .split_whitespace()
-                    .map(LocalName::from)
+                    .split(' ')
+                    .filter(|x| !x.is_empty())
+                    .map(String::from)
                     .collect()
             });
 
@@ -308,7 +310,7 @@ impl Element {
 #[allow(missing_debug_implementations)]
 #[derive(Clone)]
 pub struct Classes<'a> {
-    inner: hash_set::Iter<'a, LocalName>,
+    inner: hash_set::Iter<'a, String>,
 }
 
 impl<'a> Iterator for Classes<'a> {

Most of the remaining time is taken by HashSet internals. A larger refactor would probably be necessary to get rid of that.

May 20 '20 14:05 leo60228

Now that I think about it, most programs bottlenecked by parsing will likely be parsing many HTML documents. I wonder if having a thread-local RefCell<HashMap<String, HashSet<String>>> where the HashSets are cloned while parsing would be a good idea? This would still incur allocation, but it would mean that every other part of HashSet construction would only occur a few times per execution.

May 20 '20 14:05 leo60228

Note: You can just write use fnv::{FnvHashMap as HashMap, FnvHashSet as HashSet}; as a drop-in replacement causing a minimal diff.

May 21 '20 13:05 Boscop

Now that I think about it, most programs bottlenecked by parsing will likely be parsing many HTML documents. I wonder if having a thread-local RefCell<HashMap<String, HashSet<String>>> where the HashSets are cloned while parsing would be a good idea? This would still incur allocation, but it would mean that every other part of HashSet construction would only occur a few times per execution.

I implemented this locally and it was only a very small improvement. Making a non-insignificant allocation for every element is very slow...

May 24 '20 16:05 leo60228