[SR-9323] KeyPaths are quite slow
| Previous ID | SR-9323 |
| Radar | rdar://problem/52529589 |
| Original Reporter | @weissi |
| Type | Bug |
Attachment: Download
Additional Detail from JIRA
| Votes | 14 |
| Component/s | Compiler |
| Labels | Bug, Performance |
| Assignee | None |
| Priority | Medium |
md5: 8a7995ec006b266bd138d8a19ef4ebed
is duplicated by:
- SR-11983 KeyPath performance below expectation compared to alternative (see test)
Issue Description:
Description
Naively I always assumed KeyPaths are quite fast. In my head they were basically a tuple of two function pointers (a getter and a setter, both not capturing so @convention(thin) like) that get just handed around and applied. At least I assumed that would be some sort of fast path when it has enough information at compile-time.
To see how fast/slow keypaths are, I made a quick benchmark which just incremented a struct member (Int) 100 M times. To have an idea what the theoretical maximum is, I compared that to a version that doesn't use key paths at all and just does `thing.x += 1` in a loop. (I checked the assembly and the compiler does spit out every single increment (for overflow checking) it does however unroll the loop five times). Anyway, the result is:
time taken for 100M direct increments (in s)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.02154 0.02355 0.02358 0.02613 0.02450 0.03828
Now I benchmarked that against KeyPaths
var thing = SomeStruct()
for _ in 0..<100_000_000 {
thing[keyPath: \SomeStruct.x] += 1
}
and the result is:
time taken for 100M keypath increments (in s)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.691 4.698 4.722 4.738 4.778 4.821
which is 200x the runtime of the original one. I used Apple Swift version 5.0-dev (LLVM cbe8d5e28f, Clang 3452631569, Swift 201dcba300), before that it was even slower.
Then I tried to understand why Keypaths are so slow and I created yet another benchmark which goes through a pretty naive approximation:
public struct SomeStruct {
public var x: Int = 0
public init(){}
}
public struct FakeWritableKeyPath<Thing, Element> {
public let writeIt: (inout Thing, Element) -> Void
public let readIt: (Thing) -> Element
}
extension SomeStruct {
public static let fakeKeyPathsForX: FakeWritableKeyPath<SomeStruct, Int> =
FakeWritableKeyPath(writeIt: { thing, newValue in thing.x = newValue },
readIt: { thing in return thing.x })
}
and the loop was
for _ in 0..<100_000_000 {
let read = SomeStruct.fakeKeyPathsForX.readIt
let write = SomeStruct.fakeKeyPathsForX.writeIt
write(&thing, read(thing) + 1)
}
to my absolute surprise, that yielded better performance ("only" 47x slower):
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.073 1.091 1.103 1.116 1.131 1.217
To finish off, I benchmarked against what I thought would kind of approximate the implementation at least in a fast path (just handing two function pointers around):
public struct FakeCheatedWritableKeyPath {
public let writeIt: @convention(thin) (inout SomeStruct, Int) -> Void
public let readIt: @convention(thin) (SomeStruct) -> Int
}
extension SomeStruct {
public static let fakeCheatedKeyPathsForX: FakeCheatedWritableKeyPath =
FakeCheatedWritableKeyPath(writeIt: { thing, newValue in thing.x = newValue },
readIt: { thing in return thing.x })
}
with the loop just like above. That started to yield reasonable performance
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2298 0.2329 0.2351 0.2362 0.2401 0.2440
which is only about 10x slower than the direct additions and I think that's reasonable because INC is a processor instruction which naturally is a bit faster than 'function call to read the value, increment, function call to write the value'. Also loop unrolling etc...
Notes
Compiler
Apple Swift version 5.0-dev (LLVM cbe8d5e28f, Clang 3452631569, Swift 201dcba300)
Target: x86_64-apple-darwin18.2.0
OS
macOS 10.14 on
Model Identifier: MacBookPro15,1
Processor Name: Intel Core i9
Processor Speed: 2.9 GHz
Observations
I found {{ %5 = keypath $WritableKeyPath<SomeStruct, Int>, (root $SomeStruct; stored_property #SomeStruct.x : $Int) // users: %20, %8}} in the SIL which looks like the compiler has actually quite some understanding of key paths, so maybe there's hope they will soon be faster? 😉
Code
the structs/fake key paths were defined in a module Foo and all the calls were always from another module TestApp in order not to get any inlining effects. But even with everything in one module, the slow versions didn't get faster at all.
the whole code is attached (the .tar.gz and can be run with just swift run -c release), but is also here (note that everything below // MODULE: Foo is in another module.
import Foundation
import Foo
public func measure(_ fn: () throws -> Int) rethrows -> [TimeInterval] {
func measureOne(_ fn: () throws -> Int) rethrows -> (TimeInterval, Int) {
let start = Date()
let v = try fn()
let end = Date()
return (end.timeIntervalSince(start), v)
}
let firstRes = try measureOne(fn).1 /* pre-heat and throw away */
var measurements = Array(repeating: 0.0, count: 10)
for i in 0..<10 {
let timeAndRes = try measureOne(fn)
measurements[i] = timeAndRes.0
precondition(firstRes == timeAndRes.1)
}
print(firstRes)
return measurements
}
public func measureAndPrint(desc: String, fn: () throws -> Int) rethrows -> Void {
print("measuring: \(desc): ", terminator: "")
let measurements = try measure(fn)
print(measurements.reduce("") { $0 + "\($1), " })
}
measureAndPrint(desc: "direct") {
var thing = SomeStruct()
for _ in 0..<100_000_000 {
thing.x += 1
}
return thing.x
}
measureAndPrint(desc: "fake key paths") {
var thing = SomeStruct()
for _ in 0..<100_000_000 {
let read = SomeStruct.fakeKeyPathsForX.readIt
let write = SomeStruct.fakeKeyPathsForX.writeIt
write(&thing, read(thing) + 1)
}
return thing.x
}
measureAndPrint(desc: "totally cheated fake key paths") {
var thing = SomeStruct()
for _ in 0..<100_000_000 {
let read = SomeStruct.fakeCheatedKeyPathsForX.readIt
let write = SomeStruct.fakeCheatedKeyPathsForX.writeIt
write(&thing, read(thing) + 1)
}
return thing.x
}
measureAndPrint(desc: "real key paths") {
var thing = SomeStruct()
for _ in 0..<100_000_000 {
thing[keyPath: \SomeStruct.x] += 1
}
return thing.x
}
// MODULE: Foo
public struct SomeStruct {
public var x: Int = 0
public init(){}
}
// compiler generated
// fake
public struct FakeWritableKeyPath<Thing, Element> {
public let writeIt: (inout Thing, Element) -> Void
public let readIt: (Thing) -> Element
}
extension SomeStruct {
public static let fakeKeyPathsForX: FakeWritableKeyPath<SomeStruct, Int> =
FakeWritableKeyPath(writeIt: { thing, newValue in thing.x = newValue },
readIt: { thing in return thing.x })
}
// cheat
public struct FakeCheatedWritableKeyPath {
public let writeIt: @convention(thin) (inout SomeStruct, Int) -> Void
public let readIt: @convention(thin) (SomeStruct) -> Int
}
extension SomeStruct {
public static let fakeCheatedKeyPathsForX: FakeCheatedWritableKeyPath =
FakeCheatedWritableKeyPath(writeIt: { thing, newValue in thing.x = newValue },
readIt: { thing in return thing.x })
}
cc @jckarter
Yeah, the infrastructure is there to teach the compiler about key paths and optimize them, but nothing's been done yet to take advantage of it. The generic runtime implementation also has not been investigated for performance at all. "Make it correct, then make it fast," as they say. One thing you can do is hoist the key path reference out of the loop; the compiler isn't smart enough to do that itself yet, and while the runtime implementation does cache the result of instantiating a key path object, it nonetheless still does a runtime call to get the cached value. This cuts the overhead down about 15% compared to forming the key path inside the loop for me:
func foo() -> SomeStruct {
var thing = SomeStruct()
let kp = \SomeStruct.x
for _ in 0..<100_000_000 {
thing[keyPath: kp] += 1
}
return thing
}
(If a loop that does `x += 1` 100,000,000 times isn't getting folded into a single `x += 100_000_000`, that's also a bug; it would be valid to do so even in the face of overflow checks.)
@jckarter I filed SR-10166 for the optimizer issue and was able to reproduce it independently of this KeyPath issue.
Thanks!
@swift-ci create
We are experiencing also delays. Is there a way to make this a prio?
The compiler already optimizes constant keypaths like thing[keyPath: \SomeStruct.x] += 1.
@doozMen Can you provide a test case which shows the problem with a recent compiler?
Hi @eeckstein, we typically see the problem surface when key paths are passed into generic contexts. This little benchmark shows that key paths are ~7x slower than closures (on my computer):
import Foundation
func get<Root, Value>(root: Root, keyPath: KeyPath<Root, Value>) -> Value {
root[keyPath: keyPath]
}
func get<Root, Value>(root: Root, getter: (Root) -> Value) -> Value {
getter(root)
}
struct User {
var id = 0
var name = "Blob"
var isAdmin = true
}
let user = User()
var start = Date()
let keyPath = \User.name
for _ in 1...1_000_000 {
precondition(get(root: user, keyPath: keyPath) == "Blob")
}
let keyPathTime = Date().timeIntervalSince(start)
print("Key path time", keyPathTime)
start = Date()
let closure: (User) -> String = { $0.name }
for _ in 1...1_000_000 {
precondition(get(root: user, getter: closure) == "Blob")
}
let closureTime = Date().timeIntervalSince(start)
print("Closure time", closureTime)
print("Key path / closure factor", keyPathTime / closureTime)
I get the following output:
Key path time 0.038166046142578125
Closure time 0.005627989768981934
Key path / closure factor 6.781470420029231
We have noticed this slowdown very explicitly in many of our libraries that leverage key paths in generic algorithms.
There's a PR open https://github.com/swiftlang/swift/pull/70451 from @Azoy that may fix some of these problems, but I'm not sure of the status of it.
The problem here is that keypath is a global variable. If you make it a local variable or literal, or a global computed variable (var keyPath: KeyPath<User, String> { \User.name }) then the optimizer can "constant fold" it.
Hi @eeckstein, fair enough! Here's another benchmark that shows the problem that doesn't involve globals. In this one I am passing a key path into a generic context, and then escaping it to perform some work with it:
import Foundation
func benchmark<Root, Value>(root: Root, keyPath: KeyPath<Root, Value>) async -> TimeInterval {
await Task {
let start = Date()
for _ in 1...1_000_000 {
_ = root[keyPath: keyPath]
}
return Date().timeIntervalSince(start)
}.value
}
func benchmark<Root, Value>(root: Root, getter: @escaping (Root) -> Value) async -> TimeInterval {
await Task {
let start = Date()
for _ in 1...1_000_000 {
_ = getter(root)
}
return Date().timeIntervalSince(start)
}.value
}
struct User {
var id = 0
var name = "Blob"
var isAdmin = true
}
let user = User()
let keyPathTime = await benchmark(root: user, keyPath: \.name)
print("Key path time", keyPathTime)
let closureTime = await benchmark(root: user, getter: { $0.name })
print("Closure time", closureTime)
print("Key path / closure factor", keyPathTime / closureTime)
When I run this I get worse results than my previous benchmark:
Key path time 0.03435099124908447
Closure time 0.0030590295791625977
Key path / closure factor 11.229375316628346
And if the data type being keypath'd into is more complex (like a nested array+dictionary+etc…), then the benchmarks get much, much worse:
let complexDataType = [[true: [user]]]
let keyPathTime = await benchmark(root: complexDataType, keyPath: \.[0][true]![0].name)
print("Key path time", keyPathTime)
let closureTime = await benchmark(root: complexDataType, getter: { $0[0][true]![0].name })
print("Closure time", closureTime)
print("Key path / closure factor", keyPathTime / closureTime)
In this situation key paths are 100x slower:
Key path time 1.4945909976959229
Closure time 0.013473033905029297
Key path / closure factor 110.93202972925145
FWIW, there have been a number of Swift forum posts over the years bringing up the performance problems of key paths (1, 2, 3, 4, 5, and more), and while it does seem that there have been some improvements to the performance, there is still a big gap. I thought this was a well known problem and was hopeful that #70451 might bring some much needed relief!
There are two aspects of this problem:
-
the compiler can try to constant fold keypaths. In this case there shouldn't by any overhead. We can improve the compiler to e.g. constant fold global keypaths (like in your first benchmark), or specialize the Task-init closure with a constant keypath (in your second benchmark). But there will always be cases where constant folding will fail for some reason.
-
If a keypath cannot be constant folded, the keypath runtime machinery is used. Maybe we can improve the performance of this as well. @jckarter knows more details on this than I do.
Hi @eeckstein, thank you very much for the responses and info.
- We can improve the compiler to e.g. constant fold global keypaths (like in your first benchmark), or specialize the Task-init closure with a constant keypath (in your second benchmark). But there will always be cases where constant folding will fail for some reason.
I think one of the places where this folding failure can become particularly pernicious is in @dynamicMemberLookup. This is a situation where one thinks they are doing simple dot-chaining syntax, but secretly they are invoking the key path machinery an incurring the performance const. And DML is quite pervasive in Swift frameworks, such as SwiftUI.
While trying to cook up a benchmark for the performance problem with DML I came across a much bigger performance problem that may be a folding problem. It seems that subscript key paths have a pretty serious performance problem:
import Foundation
extension Int {
subscript(add value: Int) -> Int {
self + value
}
}
func directAccessBenchmark() -> TimeInterval {
let number = 1
let start = Date()
for _ in 1...1_000_000 {
precondition(number + 1 == 2)
}
return Date().timeIntervalSince(start)
}
func keyPathLookupBenchmark() -> TimeInterval {
let number = 1
let start = Date()
for _ in 1...1_000_000 {
precondition(number[keyPath: \.[add: 1]] == 2)
}
return Date().timeIntervalSince(start)
}
let addOne: (Int) -> Int = { $0 + 1 }
func closureBenchmark() -> TimeInterval {
let number = 1
let start = Date()
for _ in 1...1_000_000 {
precondition(addOne(number) == 2)
}
return Date().timeIntervalSince(start)
}
let directionAccessTime = directAccessBenchmark()
print("Direct access", directionAccessTime)
let keyPathLookupTime = keyPathLookupBenchmark()
print("Key path access", keyPathLookupTime)
let closureTime = closureBenchmark()
print("Closure", closureTime)
On my computer this prints out:
Direct access 0.0
Key path access 0.6253789663314819
Closure 0.0010019540786743164
And if you were to do this benchmark with an array of integers and the \.[0] key path it gets worse. And it starts to get really bad if you deal with nested arrays.
And beyond this subscript key path problem, there are more benchmarks in this forum post that show how some simple generic algorithms perform 6-7x slower when using key paths over closures.
2. If a keypath cannot be constant folded, the keypath runtime machinery is used. Maybe we can improve the performance of this as well. @jckarter knows more details on this than I do.
Yes, it's my impression that is what #70451 aims to improve.