prolog icon indicating copy to clipboard operation
prolog copied to clipboard

Idea for an efficient string representation

Open guregu opened this issue 2 years ago • 11 comments

I had a crazy idea recently. Not sure if it would work, but it could maybe be a relatively low-effort way to get faster strings.

The gist is:

  • Represent strings as []rune
  • Abuse negative rune values for special values
    • e.g. rune(-1) could mean [], rune(-5) could mean variable Var("_5")

This would let us take advantage of Go's slices for a lot of common list operations.

Potential issues I could see:

  • variable IDs are uint64 so there's not enough space to represent them with int32
  • lots of special casing in all the builtin functions probably required (maybe easier now with the nice iterators?)
  • when and where should this be applied? I think it could be used for double-quoted literals at least

Maybe a slightly less crazy representation would be:

type String struct {
   data []rune
   vars []Var
}

and store the variable ID as a local index instead of the global ID, so rune(-5) would refer to vars[5] or something like that.

BTW, this project got a shoutout in @triska's latest video! https://youtu.be/CvLsVfq6cks?t=4625 Congrats 🎉

guregu avatar Jul 23 '22 14:07 guregu

An efficient string representation will be a huge improvement! Currently, we have one Term representation for each value for simplicity. By introducing specialized representations, we have plenty of room for optimization.

We might also be able to achieve an efficient string representation by introducing 2 specialized Term types, namely Char and List.

type Char rune
var _ Term = Char(0) // Behaves as if it's a single character Atom.
type List []Term
var _ Term = List(nil) // Behaves as if it's a `./2` or `Atom("[]")`.

Let's see how other implementations implement strings.

Other than strings, BigInteger for unbounded integers is along the lines of specialized representations. https://github.com/ichiban/prolog/issues/191#issuecomment-1133813293

ichiban avatar Jul 24 '22 02:07 ichiban

I've been thinking about this. It could be better if we generalize compound terms, not lists.

type Compound interface {
	Term
	Arg(n int) Term
}

type Term interface {
	...
	PrincipalFunctor() Compound // e.g. '.'/2
}

That way, we should be able to make any structures, even string, a Compound.

type CharacterList string
var _ Compound = CharacterList("")

ichiban avatar Aug 02 '22 00:08 ichiban

Did you consider partial strings? Here is an explanation.

UWN avatar Aug 02 '22 16:08 UWN

As of v0.11.0, any Go values that implement engine.Compound are compound terms. This allows us to introduce efficient internal representations for lists and strings.

type Compound interface {
	Functor() Atom
	Arity() int
	Arg(n int) Term
}

v0.11.0 is shipped with four efficient internal representations, namely list, charList, codeList, and partial but not limited to them. You can implement one and return it from your custom predicate.

list

list is an efficient representation for a list of arbitrary terms. It's just a Go slice.

$ $(go env GOPATH)/bin/1pl
Top level for ichiban/prolog v0.11.0
This is for testing purposes only!
See https://github.com/ichiban/prolog for more details.
Type Ctrl-C or 'halt.' to exit.
?- X = [a, b, c], go_string(X, S).
X = [a,b,c],
S = 'engine.list{"a", "b", "c"}'.
type list []Term

charList

charList is an efficient representation for a list of one-char atoms. It's just a Go string.

$ $(go env GOPATH)/bin/1pl
Top level for ichiban/prolog v0.11.0
This is for testing purposes only!
See https://github.com/ichiban/prolog for more details.
Type Ctrl-C or 'halt.' to exit.
?- X = "abc", go_string(X, S).
X = [a,b,c],
S = '"abc"'.
type charList string

codeList

codeList is an efficient representation for a list of character codes. It's just a Go string.

$ $(go env GOPATH)/bin/1pl
Top level for ichiban/prolog v0.11.0
This is for testing purposes only!
See https://github.com/ichiban/prolog for more details.
Type Ctrl-C or 'halt.' to exit.
?- set_prolog_flag(double_quotes, codes).
true.
?- X = "abc", go_string(X, S).
X = [97,98,99],
S = '"abc"'.
type codeList string

partial

partial is an efficient representation for a partial list. It's a Go struct with a base Compound and a tail Term. The base Compound must be a list without any variables in its spine.

$ $(go env GOPATH)/bin/1pl
Top level for ichiban/prolog v0.11.0
This is for testing purposes only!
See https://github.com/ichiban/prolog for more details.
Type Ctrl-C or 'halt.' to exit.
?- append("abc", X, L), go_string(L, S).
X = X,
L = [a,b,c|X],
S = 'engine.partial{Compound:"abc", tail:X}'.
type partial struct {
	Compound
	tail Term
}

ichiban avatar Aug 28 '22 07:08 ichiban

Wow, this sounds amazing, thank you a lot for working on this!

Two small points I noticed:

  1. Ideally, when writing X = [a, b, c], the system should detect that this can be efficiently represented as a charList, exactly as if it were written as X = "abc" (with double_quotes set to chars), and use a string instead of a slice in this case.

  2. Personally, I do not think that a dedicated specialized representation for codeList is worth supporting: Application programmers can simply use lists of characters, which yields more readable programs and answers.

triska avatar Aug 28 '22 08:08 triska

How expensive is

?- Given_long_chars = [C|Rest].

You are supporting the '\0\'-character so I wonder: Is this operation constant? Does this operation produce new auxiliary heapstructures? It seems this is necessary for you.

In Scryer, partial strings are constant for above operation and do not involve any extra space.

UWN avatar Aug 28 '22 13:08 UWN

@triska Thank you!

  1. I wonder if a Prolog programmer would write [a, b, c] to mean a string in practice. They might be emphasizing its list aspect- expecting O(1) access to the elements which we haven't optimized yet. If that's the case, charList is not an efficient representation. What do you think?

  2. The cost of supporting codeList is quite cheap and it simplifies atom_codes/2 and the parser. So, I think it's still worth supporting them even if it's not beneficial to the real users.

ichiban avatar Aug 29 '22 01:08 ichiban

@UWN It's constant. A Go string is a slice of bytes and the slice points to a section of the underlying array of bytes.

In Go, a string is in effect a read-only slice of bytes. https://go.dev/blog/strings

// https://github.com/golang/go/blob/846c378b8c0cebd2d8522a5693b45ca95b018a78/src/runtime/slice.go#L15

type slice struct {
	array unsafe.Pointer
	len   int
	cap   int
}

So, Rest doesn't require a new array in the heap but just a new slice that points to the section of the same underlying array that is one character narrower than Given_long_chars.

ichiban avatar Aug 29 '22 01:08 ichiban

OK, but where does that slice struct get stored? Does it fit into a single word?

UWN avatar Aug 29 '22 06:08 UWN

It's stored in the heap in this case but it's not because it's a slice. In Go, the compiler decides whether an object gets stored in the heap or the stack and it always stores interface{} objects in the heap, which we used to abstract Compound.

The compiler will store multi-word objects like slice (3 words + padding) in the stack if the conditions are met.

ichiban avatar Aug 29 '22 12:08 ichiban

Looks good. So these details are just in the compiler and not in your code.

UWN avatar Aug 30 '22 08:08 UWN