StatsModels.jl icon indicating copy to clipboard operation
StatsModels.jl copied to clipboard

Speed creating formula

Open matthieugomez opened this issue 6 years ago • 14 comments
trafficstars

On my computer, it takes 0.3s to create a formula with functions in the REPL:

using StatsModels
@time @formula(y ~ x+y)
#  0.000022 seconds (16 allocations: 912 bytes)
@time @formula(y ~ log(x) + log(y))
#  0.355020 seconds (1.13 M allocations: 57.160 MiB, 6.92% gc time)

This sounds excessive. Is there a way to make this faster?

matthieugomez avatar Nov 10 '19 22:11 matthieugomez

I strongly recommend timing using BenchmarkTool's @btime. Which will properly ignore compile time* and is resistant to noise: On julia 1.3-rc4 i get:

julia> @btime @formula(y ~ x+y);
  452.624 ns (12 allocations: 752 bytes)

julia> @btime @formula(y ~ log(x) + log(y));
  10.219 μs (72 allocations: 4.88 KiB)

*(If compile time is the issue then that should be highlighted since solutions are very different)

oxinabox avatar Nov 11 '19 11:11 oxinabox

I know that @btime returns a smaller time, but that's not the point. My point is that it takes way too much time to create a formula in the REPL / global scope / interactive use.

Btw, I don't think the difference between @time and @btime is about compile time, because @formula(y ~ log(x) + log(y)) is slow even the second time. I think it is more about global vs local scope.

matthieugomez avatar Nov 11 '19 11:11 matthieugomez

Any ideas why global scope would be slower?

kleinschmidt avatar Nov 11 '19 14:11 kleinschmidt

It does seem to only happen in global/REPL scope:

julia> g() = @formula(y ~ log(x) + log(y) + log(z))
g (generic function with 1 method)

julia> @time g()
  0.366315 seconds (1.05 M allocations: 51.787 MiB, 1.78% gc time)
FormulaTerm
Response:
  y(unknown)
Predictors:
  (x)->log(x)
  (y)->log(y)
  (z)->log(z)

julia> @time g()
  0.000323 seconds (98 allocations: 6.109 KiB)
FormulaTerm
Response:
  y(unknown)
Predictors:
  (x)->log(x)
  (y)->log(y)
  (z)->log(z)

julia> @time @formula(y ~ log(x) + log(y) + log(z))
  0.342751 seconds (958.20 k allocations: 47.005 MiB)
FormulaTerm
Response:
  y(unknown)
Predictors:
  (x)->log(x)
  (y)->log(y)
  (z)->log(z)

julia> @time @formula(y ~ log(x) + log(y) + log(z))
  0.354228 seconds (958.19 k allocations: 47.031 MiB, 2.29% gc time)
FormulaTerm
Response:
  y(unknown)
Predictors:
  (x)->log(x)
  (y)->log(y)
  (z)->log(z)

kleinschmidt avatar Nov 11 '19 14:11 kleinschmidt

That's what I was going to answer. But I am not sure the function g evaluates the macro the second time?

matthieugomez avatar Nov 11 '19 15:11 matthieugomez

yeah so maybe that (and @btime) is just estimating the time to actually run the constructors that are returned?

kleinschmidt avatar Nov 11 '19 15:11 kleinschmidt

function g()
	@time @formula(y ~ log(x) + log(y) + log(z))
	@time @formula(y ~ log(x) + log(y) + log(z))
end

still gives 0.3 for both. So actually I don't think it's just local vs global scope.

matthieugomez avatar Nov 11 '19 15:11 matthieugomez

it's also not the macro itself:

julia> @time @macroexpand @formula(y ~ log(x) + log(y))
  0.000954 seconds (301 allocations: 17.516 KiB)
:(StatsModels.Term(:y) ~ StatsModels.capture_call(log, ((x,)->log(x)), (:x,), $(Expr(:copyast, :($(QuoteNode(:(log(x))))))), [StatsModels.Term(:x)]) + StatsModels.capture_call(log, ((y,)->log(y)), (:y,), $(Expr(:copyast, :($(QuoteNode(:(log(y))))))), [StatsModels.Term(:y)]))

kleinschmidt avatar Nov 11 '19 15:11 kleinschmidt

and it's not (entirely) the capture_call:

julia> @time StatsModels.capture_call(log, x->log(x), (:x, ), :(log(x)), [StatsModels.Term(:x)])
  0.011115 seconds (7.08 k allocations: 380.571 KiB)
(x)->log(x)

julia> @time StatsModels.capture_call(log, x->log(x), (:x, ), :(log(x)), [StatsModels.Term(:x)])
  0.010636 seconds (7.08 k allocations: 380.040 KiB)
(x)->log(x)

I wonder if the culprit is the overloading of + and ~ to create the formula term, plus all the parameters on the ~~FormulaTerm~~ FunctionTerm structs? That doesn't seem likely either since that would just be compiler time right? Unless what's happening is that every time the anonymous function is generated it's a new anonymous function and then the types don't match...which does seem to be the case in global but not in function scope:

julia> t1 = StatsModels.capture_call(log, x->log(x), (:x, ), :(log(x)), [StatsModels.Term(:x)])
(x)->log(x)

julia> t2 = StatsModels.capture_call(log, x->log(x), (:x, ), :(log(x)), [StatsModels.Term(:x)])
(x)->log(x)

julia> typeof(t1).parameters
svec(typeof(log), var"#87#88", (:x,))

julia> typeof(t2).parameters
svec(typeof(log), var"#89#90", (:x,))

julia> g()
FormulaTerm
Response:
  y(unknown)
Predictors:
  (x)->log(x)
  (y)->log(y)
  (z)->log(z)

julia> g().rhs[1]
(x)->log(x)

julia> typeof(g().rhs[1]).parameters
svec(typeof(log), var"#45#48", (:x,))

julia> typeof(g().rhs[1]).parameters
svec(typeof(log), var"#45#48", (:x,))

kleinschmidt avatar Nov 11 '19 15:11 kleinschmidt

So this COULD all be compiler time, because running the same macro twice in global scope gives different types for the anonymous functions, but calling twice a function that wraps a macro doesn't

kleinschmidt avatar Nov 11 '19 15:11 kleinschmidt

This is making me more and more convinced that @oxinabox proposal (in #117 I think) to do away with the anonymous function version of the FunctionTerm and just use the typeof the original function could be the right approach for a number of reasons (in this case, because you'd pay the compiler price ONCE for each type of function, like log, instead of once EACH time any function is used)

kleinschmidt avatar Nov 11 '19 15:11 kleinschmidt

I've now implemented that suggestion (#183), which doesn't create a new anonymous function but just stores the head function and the args. With that PR we get the expected behavior, where the second invocation is much faster than the first:

julia> @time @formula(y ~ a+b)
  0.006679 seconds (2.17 k allocations: 135.530 KiB)
FormulaTerm
Response:
  y(unknown)
Predictors:
  a(unknown)
  b(unknown)

julia> @time @formula(y ~ a+b)
  0.000013 seconds (11 allocations: 768 bytes)
FormulaTerm
Response:
  y(unknown)
Predictors:
  a(unknown)
  b(unknown)

julia> @time @formula(y ~ log(a)+log(b))
  0.268374 seconds (373.86 k allocations: 19.502 MiB)
FormulaTerm
Response:
  y(unknown)
Predictors:
  (a)->log(a)
  (b)->log(b)

julia> @time @formula(y ~ log(a)+log(b))
  0.000051 seconds (27 allocations: 1.703 KiB)
FormulaTerm
Response:
  y(unknown)
Predictors:
  (a)->log(a)
  (b)->log(b)

There's still some (likely not really necessary) compilation going on though, because invoking structurally different formulae, even with calls to log, still leads to compilation time:

julia> @time @formula(y ~ log(a)+log(b)+log(c)+log(d))
  0.093329 seconds (126.30 k allocations: 6.969 MiB)
FormulaTerm
Response:
  y(unknown)
Predictors:
  (a)->log(a)
  (b)->log(b)
  (c)->log(c)
  (d)->log(d)

julia> @time @formula(y ~ log(a)+log(b)+log(c)+log(d))
  0.000045 seconds (49 allocations: 3.297 KiB)
FormulaTerm
Response:
  y(unknown)
Predictors:
  (a)->log(a)
  (b)->log(b)
  (c)->log(c)
  (d)->log(d)

julia> @time @formula(y ~ log(a)+log(b)+log(c)+log(e))
  0.000036 seconds (49 allocations: 3.297 KiB)
FormulaTerm
Response:
  y(unknown)
Predictors:
  (a)->log(a)
  (b)->log(b)
  (c)->log(c)
  (e)->log(e)

julia> @time @formula(y ~ log(a)+log(b)+log(c)+log(d)+log(e))
  0.083267 seconds (118.19 k allocations: 6.459 MiB, 8.50% gc time)
FormulaTerm
Response:
  y(unknown)
Predictors:
  (a)->log(a)
  (b)->log(b)
  (c)->log(c)
  (d)->log(d)
  (e)->log(e)

So this may be a step in the right direction, but it's not quite there yet. I think the issue is that + combines terms into a tuple, and the methods are not marked to @nospecialize on those tuples, so every time you get a tuple with a new number of elements or different types, you hit the compiler again. I'll try sprinkling in some @nospecialize but I'm not a compiler wizard so I could use some guidance :)

kleinschmidt avatar Jul 11 '20 18:07 kleinschmidt

Sprinkling @nospecialize doesn't really seem to help, so I also got rid of the tuple-based representation of concatenated terms in favor of Vector{AbstractTerm} (where the type is the same no matter the specific type or number of the contents), but this still doesn't get rid of the first-run penalty. I helps a bit but it's still there (at least in global scope). If you want to test yourself it's in the dfk/vector branch (builds off of #183 )

kleinschmidt avatar Jul 12 '20 18:07 kleinschmidt

In order to be a bit more systematic about this, I created a very dumb benchmarking script which just runs a bunch of formula expressions twice and prints the timings. I'm intentionally not using benchmark tools because it's the compilation I'm trying to hit.

Here's the script and the results from running it on master, dfk/syntax (#183), and dfk/vector (replaces tuples-of-terms with vectors-of-terms): https://gist.github.com/kleinschmidt/f51d305d56a590030c4f8688cbf18929

The bottom line is that moving to the alternative representation of the FunctionTerm proposed in #183 cuts down the initial compilation time (by a factor of ~4), and does not require additional compilation when the same formula is created again. However, it does still hit the compiler when another function is used (e.g. exp instead of log), as well as when the number of terms changes (e.g., log(a) vs. log(a) + log(b) + log(x)). Using a vector representation of + combined terms doesn't solve that problem completely, unfortunately, but it does help in a few seemingly random cases.

kleinschmidt avatar Jul 12 '20 20:07 kleinschmidt