symfit icon indicating copy to clipboard operation
symfit copied to clipboard

Hessian performance

Open Jhsmit opened this issue 4 years ago • 9 comments

I'm having some performance issues with a modestly sized model (37 parameters). The fitting takes a rather long time and profiling revealed that 80% of time time is spend in calling coveranciance_matrix / eval_hessian.

symfit_calltree.pdf

I use the standard LBFGSB mimimizer and when I replace Model with Callablemodel the total fitting time reduces from 200 seconds to 17 seconds. The fit result is the same.

Jhsmit avatar Feb 05 '20 13:02 Jhsmit

Hmmn, that's not great. I'm not sure I'd call 37 parameters modestly sized though ;)

  1. Could you try with GradientModel? That will only use an analytical Jacobian.
  2. The fit result should be the same with Model vs CallableModel for LBFGSB, since LBFGSB doesn't use the Hessian. What could differ is the covariance matrix; could you check that? Even there there's probably only a difference for non-linear models with multiple components, but I'm not up to speed with the maths/stats.

pckroon avatar Feb 05 '20 14:02 pckroon

GraidentModel takes 32 seconds. Covariance matrix is None for CallableModel as for GradientModel.

Actually, on closer inspection, the model has 27 parameters. Math is hard. But compared to the ~300 parameter model I would like to fit, its still modest.

Jhsmit avatar Feb 05 '20 15:02 Jhsmit

GraidentModel takes 32 seconds.

Sounds much better. Still long-ish, but I'd say acceptable. Even 200s could be considered acceptable, depending on the application

Covariance matrix is None for CallableModel as for GradientModel.

That's somewhere between strange and worrying. All Models should result in a covariance matrix. Either from directly inverting the Hessian (if it's available), or from the approximation J.T.dot(J). It could be that's not invertible for your model because you have e.g. codependent parameters.

PS. Do you really need all 300 parameters?

pckroon avatar Feb 06 '20 08:02 pckroon

Yes, I agree, 32 seconds is perfectly acceptable for a 27 parameter model. However, if 200s would be acceptable is in my opinion dependent on how much time symfit adds to this fit compared to the scipy only fit weighted againts the gain in fitting accuracy or of course time spend programming.

For example, in my profiling I find 4 314 861 calls to sympy.core.basic._aresame, with a total time of 100s (33.8%, percantages dont really add up to the 200 seconds). Similarly, there are 7308 calls to sympy.core.basic.subs (76%), which originate from 3 calls to jacobian_from_model. Why do we need 3 calls to get the jacobian model? This seems like a one-time thing, so perhaps a chached_property of sorts could reduce the total time by half. And why does subs need to be called roughly 2500x to generate the jacobian model? Is this because of all permutations of derivatives for 27 parameters?

PS. My physical system has a total of ~800 observables, and we have a total of ~1400 'measurements'. However, not all measurements are independent so the system is underdetermined (aside from being a total pain to fit). Also, not all parameters affect all measurements, ie say parameters 0-4 affect measurement value 1, and parameters 1-6 affect measurement 2, etc etc. So in a way there are many quasi independent fitting problems with some coupling between them.

Jhsmit avatar Feb 06 '20 10:02 Jhsmit

My model's function_dict has 9 entries, and the number of parameters is 27. However, as a consequence of what i described in the PS. above, not all parameters are in each function. I suspect that because of that the nested for loop in jacobian_from_model can be sped up quite a bit by using this information and setting those derivatives to zero without going through the effort of actually calculating them.

Jhsmit avatar Feb 06 '20 11:02 Jhsmit

Symfit was never written for performance :) I think the bigger concern is how the calculation of e.g. Hessians scales. For 5 parameters it'll still be near instant, so no problem no matter how much slower. The Jacobian of GradientModels is cached (IIRC), so nothing to gain there. If you really want to dig down to the bottom, run a line profiler (kernprof [1]). I don't have the call graph in mind clear enough to comment on why some methods/functions are called disproportionately often.

You can always create your own case-specific subclass of Model or GradientModel where you just hardcode python expressions appropriate for your case. However, I'm not sure it'll be worth the effort: I don't know where the bottle neck is exactly. Of course you can also just decide you don't care about gradients of any kind, make a CallableModel, and fit using something gradient free such as Powell.

[1] https://github.com/rkern/line_profiler

pckroon avatar Feb 06 '20 11:02 pckroon

There are some interesting points here. Indeed, For GradientModel we still evaluate the Hessian, but using the jacobian approximation (J^T . J). This is significantly faster than analytically calculating it fully, but still involves calculating the Jacobian analytically.

I think the 3 calls to jacobian_from_model originate from 1 call to make the Jacobian, and two to make the Hessian since that is the gradient of the Jacobian. I don't think there is a way to shorten this to 2, because the first call is done with some special flags to make sure the result can be used by the second call.

It could be worth trying to optimize that code a bit but this feature was never intended for big systems of equations like this. Computing the Hessians symbolically for such a big system of equations is always going to be expensive. If you need the covariance matrix, I think it is worth looking into how exactly the results from the optimizer are treated. Sometimes the optimizer will return a numerical estimate of the Hessian, Jacobian or the covariance matrix, which we could use. But currently I think we recalculate it because we like to be precise here at symfit. However, if the user uses a CallableModel than I think we should recycle whatever numerical values we get from scipy as much as possible. And I don't think we are currently doing that, returning None instead. (Please correct me on this if I'm wrong.)

As a separate question, is this problem not better solved using the highly experimental matrix_bounds branch?

tBuLi avatar Mar 19 '20 14:03 tBuLi

I indeed considered this a matrix_bounds jobby, however my problem also has a temporal component with basically a matrix equation per time point. I wasnt sure how to best implement this so I went for the old fashioned way.

Jhsmit avatar Mar 19 '20 15:03 Jhsmit

Hmmm, well maybe you can turn it into subproblems using CallableNumericalModel? Or perhaps using IndexedVariables? Those already work, but there are no IndexedParameters and I guess you need those. A lot of this kind of advanced work still remains to be done!

tBuLi avatar Mar 20 '20 11:03 tBuLi