nickel icon indicating copy to clipboard operation
nickel copied to clipboard

String interning

Open fuzzypixelz opened this issue 3 years ago • 2 comments

Why?

Nickel terms contain two kinds of strings. The first one corresponds to string literals in source code (and even interpolated strings, through StrChunk). The second kind is Ident, which is used for both user-defined variables and symbols generated through closurization.

Interning strings would provide the following benefits:

  • Reduce allocations and memory usage through de-duplication.
  • Make comparisons cheap by using a Symbol type such as u32.

Integration in the current pipeline

Ideally, LALRPOP would be supplied an Interner data structure, which it will during parsing to generate an AST where Ident and String are replaced by a Symbol type. Subsequent consumers of the AST would have access to the same Interner to resolve Symbol -> &str as needed.

This implies at least two methods on Interner: intern and resolve. Another operation might be static_intern, which would be useful for symbols present in the standard library.

Solution space

Simple

The simplest possible interner would look something like this:

pub struct Interner {
    dedup: HashMap<String, Symbol>,
    backend: Vec<String>,
}

The dedup HashMap would be checked at every intern call to check if we already encountered this string before. If so, we return the corresponding Symbol directly. Otherwise, we need to store the string two times. Here the cost of intern is the cost of hashing String at best. At worst, we add to that two memory allocations.

Resolving a symbol would index through backend and return a &str. The cost is a single memory access. Easy enough.

It seems that lalrpop-intern actually goes this route. Though I'm not entirely sure if it's actually used in LALRPOP itself at the moment.

Don't Move

This design by matklad avoids redundant allocations by storing a &'static str references. How could they be static? This data structure has the additional variant that the contents of buffer are never moved on the heap, which guarantees that the references in dedup and backend will be valid for the lifetime of Interner. This means that when buffer is almost full, it is moved into buffers_full and replaced with a new String of twice the capacity.

This technique is called "buckets" and it's present in the Bucket backend of the string-interner crate, string-cache as part of the Servo project, boa-interner used in the Boa language, and the lasso single-threaded backend. Something similar is also present in artichoke, a Ruby implementation.

pub struct Interner {
    dedup: HashMap<&'static str, Symbol>,
    backend: Vec<&'static str>,
    buffer: String,
    buffers_full: Vec<String>,
}

Not only does this solve our issues with the redundant allocation, but it also reduces overall allocations because each new bucket's initial capacity is twice that of the previous one. So we enjoy periods where intern's cost is hashing plus cloning, separated by moments of (potentially) sizable allocations. Overall, the downside of this approach seems to be the long lifetime of the strings. Nickel however shouldn't suffer from this.

Fancy Raw Entries

This technique does a bit of (nightly) magic with HashMap, and looks like this:

pub struct Interner<H = RandomState> {
    dedup: HashMap<Symbol, (), ()>,
    hasher: H,
   // .. cropped
}

What happens here is that the HashMap itself has no hashing algorithm, instead, we insert Symbol with the hash of its corresponding string so that a subsequent de-duplication lookup would use the Raw Entry API in std to retrieve the Symbol using the hash of it given string. Almost like a HashSet where we get to control the hash of all entries, this experimental API however is nightly-only not available on HashSet, though it is available in the hashbrown crate.

I found that this commonly used, for example in string-interner, strena, and lasso.

One Giant Buffer

In this setup, everything is stored in one String. So since our buffer can, and will, move several times, we can't use references like in the buckets approach. Moreover, since we have to use Spans in this case, we add an extra level of indirection when resolving symbols, since we now make the extra step of looking up the end of each string. Resolving now is two memory accesses instead of one. And without some unsafe, we would have to check that our span is valid UTF-8.

pub struct Interner<H = RandomState> {
    dedup: HashMap<Symbol, (), ()>,
    hasher: H,
    ends: Vec<usize>,
    buffer: String,
}

This can be seen in the string backend of string-interner and the strena crates.

Honorable Mentions

  • ustr is a global, immutable, static (everything is leaked) string cache.
  • smol-str is a string type with O(1) cloning and is stack-allocated when less than 22 bytes. It's part of rust-analyzer.

Conclusion

The solutions outlined above come with different trade-offs in terms of allocations and speed of both resolving and interning. What's left now is to carve out a special solution for Nickel based on our specific requirements.

fuzzypixelz avatar Jul 25 '22 15:07 fuzzypixelz

There seem to be an option missing in this story:

When we point to a string, we point to a Rc<String>, so that we don't have to deal with the state to dereference. Since we are only intent on sharing at parse time, we can simply have a table String ↦ Rc<String>.

aspiwack avatar Jul 25 '22 15:07 aspiwack

When we point to a string, we point to a Rc<String>, so that we don't have to deal with the state to dereference. Since we are only intent on sharing at parse time, we can simply have a table String ↦ Rc<String>.

Simple and elegant.

We could probably replace Rc<String> with Rc<str>. This would remove the extra indirection when de-referencing, but also bump up the pointer size, as the size is stored inline. Not sure which is more desirable.

fuzzypixelz avatar Jul 26 '22 13:07 fuzzypixelz