prolog
prolog copied to clipboard
Idea for an efficient string representation
I had a crazy idea recently. Not sure if it would work, but it could maybe be a relatively low-effort way to get faster strings.
The gist is:
- Represent strings as
[]rune
- Abuse negative rune values for special values
- e.g.
rune(-1)
could mean[]
,rune(-5)
could mean variableVar("_5")
- e.g.
This would let us take advantage of Go's slices for a lot of common list operations.
Potential issues I could see:
- variable IDs are uint64 so there's not enough space to represent them with int32
- lots of special casing in all the builtin functions probably required (maybe easier now with the nice iterators?)
- when and where should this be applied? I think it could be used for double-quoted literals at least
Maybe a slightly less crazy representation would be:
type String struct {
data []rune
vars []Var
}
and store the variable ID as a local index instead of the global ID, so rune(-5)
would refer to vars[5]
or something like that.
BTW, this project got a shoutout in @triska's latest video! https://youtu.be/CvLsVfq6cks?t=4625 Congrats 🎉
An efficient string representation will be a huge improvement! Currently, we have one Term
representation for each value for simplicity. By introducing specialized representations, we have plenty of room for optimization.
We might also be able to achieve an efficient string representation by introducing 2 specialized Term
types, namely Char
and List
.
type Char rune
var _ Term = Char(0) // Behaves as if it's a single character Atom.
type List []Term
var _ Term = List(nil) // Behaves as if it's a `./2` or `Atom("[]")`.
Let's see how other implementations implement strings.
Other than strings, BigInteger
for unbounded integers is along the lines of specialized representations. https://github.com/ichiban/prolog/issues/191#issuecomment-1133813293
I've been thinking about this. It could be better if we generalize compound terms, not lists.
type Compound interface {
Term
Arg(n int) Term
}
type Term interface {
...
PrincipalFunctor() Compound // e.g. '.'/2
}
That way, we should be able to make any structures, even string
, a Compound
.
type CharacterList string
var _ Compound = CharacterList("")
Did you consider partial strings? Here is an explanation.
As of v0.11.0
, any Go values that implement engine.Compound
are compound terms.
This allows us to introduce efficient internal representations for lists and strings.
type Compound interface {
Functor() Atom
Arity() int
Arg(n int) Term
}
v0.11.0
is shipped with four efficient internal representations, namely list
, charList
, codeList
, and partial
but not limited to them. You can implement one and return it from your custom predicate.
list
list
is an efficient representation for a list of arbitrary terms. It's just a Go slice.
$ $(go env GOPATH)/bin/1pl
Top level for ichiban/prolog v0.11.0
This is for testing purposes only!
See https://github.com/ichiban/prolog for more details.
Type Ctrl-C or 'halt.' to exit.
?- X = [a, b, c], go_string(X, S).
X = [a,b,c],
S = 'engine.list{"a", "b", "c"}'.
type list []Term
charList
charList
is an efficient representation for a list of one-char atoms. It's just a Go string.
$ $(go env GOPATH)/bin/1pl
Top level for ichiban/prolog v0.11.0
This is for testing purposes only!
See https://github.com/ichiban/prolog for more details.
Type Ctrl-C or 'halt.' to exit.
?- X = "abc", go_string(X, S).
X = [a,b,c],
S = '"abc"'.
type charList string
codeList
codeList
is an efficient representation for a list of character codes. It's just a Go string.
$ $(go env GOPATH)/bin/1pl
Top level for ichiban/prolog v0.11.0
This is for testing purposes only!
See https://github.com/ichiban/prolog for more details.
Type Ctrl-C or 'halt.' to exit.
?- set_prolog_flag(double_quotes, codes).
true.
?- X = "abc", go_string(X, S).
X = [97,98,99],
S = '"abc"'.
type codeList string
partial
partial
is an efficient representation for a partial list. It's a Go struct with a base Compound
and a tail Term
. The base Compound
must be a list without any variables in its spine.
$ $(go env GOPATH)/bin/1pl
Top level for ichiban/prolog v0.11.0
This is for testing purposes only!
See https://github.com/ichiban/prolog for more details.
Type Ctrl-C or 'halt.' to exit.
?- append("abc", X, L), go_string(L, S).
X = X,
L = [a,b,c|X],
S = 'engine.partial{Compound:"abc", tail:X}'.
type partial struct {
Compound
tail Term
}
Wow, this sounds amazing, thank you a lot for working on this!
Two small points I noticed:
-
Ideally, when writing
X = [a, b, c]
, the system should detect that this can be efficiently represented as acharList
, exactly as if it were written asX = "abc"
(withdouble_quotes
set tochars
), and use a string instead of a slice in this case. -
Personally, I do not think that a dedicated specialized representation for
codeList
is worth supporting: Application programmers can simply use lists of characters, which yields more readable programs and answers.
How expensive is
?- Given_long_chars = [C|Rest].
You are supporting the '\0\'
-character so I wonder: Is this operation constant? Does this operation produce new auxiliary heapstructures? It seems this is necessary for you.
In Scryer, partial strings are constant for above operation and do not involve any extra space.
@triska Thank you!
-
I wonder if a Prolog programmer would write
[a, b, c]
to mean a string in practice. They might be emphasizing itslist
aspect- expecting O(1) access to the elements which we haven't optimized yet. If that's the case,charList
is not an efficient representation. What do you think? -
The cost of supporting
codeList
is quite cheap and it simplifiesatom_codes/2
and the parser. So, I think it's still worth supporting them even if it's not beneficial to the real users.
@UWN It's constant. A Go string is a slice of bytes and the slice points to a section of the underlying array of bytes.
In Go, a string is in effect a read-only slice of bytes. https://go.dev/blog/strings
// https://github.com/golang/go/blob/846c378b8c0cebd2d8522a5693b45ca95b018a78/src/runtime/slice.go#L15
type slice struct {
array unsafe.Pointer
len int
cap int
}
So, Rest
doesn't require a new array in the heap but just a new slice that points to the section of the same underlying array that is one character narrower than Given_long_chars
.
OK, but where does that slice struct
get stored? Does it fit into a single word?
It's stored in the heap in this case but it's not because it's a slice. In Go, the compiler decides whether an object gets stored in the heap or the stack and it always stores interface{}
objects in the heap, which we used to abstract Compound
.
The compiler will store multi-word objects like slice (3 words + padding) in the stack if the conditions are met.
Looks good. So these details are just in the compiler and not in your code.