torchquad icon indicating copy to clipboard operation
torchquad copied to clipboard

Elementwise numerical integration

Open sebastienwood opened this issue 2 years ago • 3 comments

Issue (how-to)

Problem Description

I need to perform numerical integration on each element of a tuple of tensor. The tuple are the parameters of a normal distribution. The integration domain can be determined by these as well. From my understanding, torchquad allows to perform one integration at a time. I could repeatedly launch an integration in a for loop fashion for each element of the tensor: that however sounds counterintuitive with the practice of tensor libraries. Is there a way for me to do it properly ?

Here is a gist with the code for the problem described: https://gist.github.com/sebastienwood/a8aafd4b4480e12c7dae3923d89d7a4c

Let me know if I can provide more information !

sebastienwood avatar Jun 21 '22 17:06 sebastienwood

@sebastienwood could this maybe help? https://torchquad.readthedocs.io/en/main/tutorial.html#speedups-for-repeated-quadrature

gomezzz avatar Jun 23 '22 12:06 gomezzz

It is a compelling solution to achieve some speed-ups, however I do not see how can I skip the python for loop using it

sebastienwood avatar Jun 23 '22 13:06 sebastienwood

Ah ok, I think I am starting to see what you mean. So, the problem is a bit outside of the classic domain where you would look for speedups. For a integration method point, at the moment, I could see easy speedups if:

  • The same integrand is evaluated on different domains (could reuse function evaluations)
  • Different integrands on the same domain (can reuse the integration grid)

If neither condition is true these are in the end different integrations without overlap, thus the optimization would require to vectorize by basically having one tensor over which we compute different integrations grid, function evaluations and integrals. Probably one can save some memory by rescaling the integration grid onto the different domains but the function evaluations still need to be computed & stored. And, finally, the integration rule applied.

Practically, the easiest way would probably be to have a version of https://github.com/esa/torchquad/blob/main/torchquad/integration/newton_cotes.py , say 'newton_cotes_multiple_integrands' or such for this that handles scaling the grid to different domains and calls 'apply_composite_rule' on the relevant parts of the big tensor for all integrands. I am not 100% how much speedup this will give but it should at least be some in case the overhead of creating the grid dominates the for loop.

Currently, however, this is not implemented. Would you, @sebastienwood , be interested in implementing this by any chance?

gomezzz avatar Jun 24 '22 08:06 gomezzz

Hello, I would also be interested. Concretely, I want to be able to replicated what one can do with fixed_quad, evaluating the integrand for the same domain repeatedly but the integrand is a numpy array, like a grid of sorts (I assume you can do it with quad as well, but getting this demo working was simpler for me because I was already familiar with fixed_quad).

import torch
import numpy as np
from torchquad import Boole, set_up_backend, Simpson
import torchquad
torchquad._deployment_test()
set_up_backend("torch", data_type
torchquad.set_log_level('ERROR')

grid_np = np.array([np.arange(15) for i in range(10)])
grid_torch = torch.from_numpy(grid_np).cuda()
flattened_grid_torch = grid_torch.flatten()

def f_torch(t, p):
    return torch.exp(-t) * p

def f_np(t):
    return np.einsum("i,jk->ijk", np.exp(-t), grid_np).transpose(1, 2, 0)

def torchquad_integral(try_reuse_sample_points):
    integrator = Simpson()
    dim = 1
    N = 37
    domain = torch.Tensor([[0, 10]])
    if try_reuse_sample_points:
        grid_points, hs, n_per_dim = integrator.calculate_grid(N, domain)
        integrands = [integrator.evaluate_integrand(lambda s: f_torch(s, p), grid_points)[0] for p in flattened_grid_torch]
        integral_value = torch.tensor([integrator.calculate_result(int_val, dim, n_per_dim, hs) for int_val in integrands])
        return integral_value.resize(*grid_torch.shape)
    integrate_jit_compiled_parts = integrator.get_jit_compiled_integrate(
        dim, N, backend="torch"
    )
    integral_value = torch.tensor([integrate_jit_compiled_parts(lambda s: f_torch(s, p), domain) for p in flattened_grid_torch])
    return integral_value.resize(*grid_torch.shape)

My results were as follows: Screen Shot 2022-12-21 at 10 23 54 AM

I would be willing to make a PR if I could have some guidance and you think that it's possible to beat scipy's time for this use-case. Also would obviously be interested if I am doing something wrong here or can get a better performance somehow in some other way. Thanks!

ilan-gold avatar Dec 21 '22 09:12 ilan-gold

Apologies, accidentally threw a different integrator in there. Updated now

ilan-gold avatar Dec 21 '22 09:12 ilan-gold

@ilan-gold First, thanks for the feedback and willingness to help!

you think that it's possible to beat scipy's time for this use-case.

In general, yes but it depends a bit on the scale of the different parameters, I think. The GPU will only benefit you if there is enough parallelism, so for small N or grids, I don't think there is hope for small cases. But for larger ones certainly.

Also would obviously be interested if I am doing something wrong here or can get a better performance somehow in some other way.

Looks good to me! Performance-wise, the only way with the current version of torchquad where I could see more speedup happening is if you were to reformulate your integration problem into a higher dimensional one and thus treat the grid of integrands as one big integrand. (the points you are evaluating basically become the integration grid in that dimension?) But this may not be possible in all applications.

I would be willing to make a PR if I could have some guidance and you think that it's possible

That's great! So, the question is a bit how we set it up, I think. We could do something similar to what I described above

Practically, the easiest way would probably be to have a version of https://github.com/esa/torchquad/blob/main/torchquad/integration/newton_cotes.py , say 'newton_cotes_multiple_integrands' or such for this that handles scaling the grid to different domains and calls 'apply_composite_rule' on the relevant parts of the big tensor for all integrands. I am not 100% how much speedup this will give but it should at least be some in case the overhead of creating the grid dominates the for loop.

This would be pretty generic but speedup may be limited.

Alternatively, maybe we could have some solver for composite integrals, allowing to integrate g((f(x),x) or something like that (maybe rather g(f(x),y)) for simplicity)?

Btw. seeing that you are at TUM: Part of torchquad was already developed as master thesis at TUM supervised by Prof. Bungartz. He has a dual affiliation with math, so if you are interested in this, it could e.g. be an interdisciplinary project, guided research or master thesis. ( @FG-TUM may be able to give more details if of interest :) )

gomezzz avatar Dec 21 '22 09:12 gomezzz

@ilan-gold First, thanks for the feedback and willingness to help!

I always like me some open source :)

In general, yes but it depends a bit on the scale of the different parameters, I think. The GPU will only benefit you if there is enough parallelism, so for small N or grids, I don't think there is hope for small cases. But for larger ones certainly.

Just so I can make sure I understand, N in this case is a sort of "hyperparameter" for the integral to control accuracy by selecting sampling points, right? Maybe I'm not following. Then what is the "grid" for you - N in higher dimensions?

Looks good to me! Performance-wise, the only way with the current version of torchquad where I could see more speedup happening is if you were to reformulate your integration problem into a higher dimensional one and thus treat the grid of integrands as one big integrand. (the points you are evaluating basically become the integration grid in that dimension?) But this may not be possible in all applications.

I don't know here. My use-case is evaluating this integral over a pair of matrices to produce a single complex matrix which I can then 2D inverse fourier transform (to give you a sense of why I am interested in repeated integration on a "grid" in the integrand). So the integral needs to be computed once per each pair of (i,j) entries from the two matrices. I would need to think if I could reformulate this. It seems like the "bottleneck" is the fact that torchquad's integration methods, including in higher dimensions, only have a one dimensional codomain (as opposed to scipy which can output a matrix by applying the integration method to element-wise to a matrix).

Btw. seeing that you are at TUM: Part of torchquad was already developed as master thesis at TUM supervised by Prof. Bungartz. He has a dual affiliation with math, so if you are interested in this, it could e.g. be an interdisciplinary project, guided research or master thesis. ( @FG-TUM may be able to give more details if of interest :) )

Unfortunately this is for my thesis haha. But I am happy to help.

I am curious if you know why the scipy implementation is so much faster. It seems like torchquad should be faster per-integral but the sum total of the iterations ends up slower. The reason I ask is because there seems to be uncertainty if this will even be faster should we go down the method you suggest.

ilan-gold avatar Dec 21 '22 09:12 ilan-gold

Just so I can make sure I understand, N in this case is a sort of "hyperparameter" for the integral to control accuracy by selecting sampling points, right? Maybe I'm not following. Then what is the "grid" for you - N in higher dimensions?

To dig a bit deeper: torchquad makes heavy use of vectorization to attain speedups (see e.g. here for some more info). For example, in the Simpson integrator you are using, the main computation happens in this line https://github.com/esa/torchquad/blob/bb51366fab08d566f708d552d0120bdd8ef78527/torchquad/integration/simpson.py#L37 . Notice how we operate on arbitrary-dimensional tensors. This thus directly uses torch's vectorized operations.

The larger the tensor in these vectorized operations are, the better your scaling will be as your will be increasingly compute-bound.

I don't know here. My use-case is evaluating this integral over a pair of matrices to produce a single complex matrix which I can then 2D inverse fourier transform (to give you a sense of why I am interested in repeated integration on a "grid" in the integrand). So the integral needs to be computed once per each pair of (i,j) entries from the two matrices. I would need to think if I could reformulate this. It seems like the "bottleneck" is the fact that torchquad's integration methods, including in higher dimensions, only have a one dimensional codomain (as opposed to scipy which can output a matrix by applying the integration method to element-wise to a matrix).

Uh, fun :D What are the dimensionalities of your matrices? So, the key point is here:

So the integral needs to be computed once per each pair of (i,j) entries from the two matrices.

This is similar to problems faced in many applications and the key point to remedy this would be to build an integrator that, for i <= K, j <= L, instead of running K x L times the above computation-heavy line, merges them into one computation.

Another perspective than the composite integration I mentioned before of looking at this would be more in a computer science way that we are basically doing a "batch" integration.

More concretly: Let's take trapezoid for simplicity as example:

Implementation-wise it could be, e.g., trapezoid_batch.py or such. And what would need to change is that here

where we apply the trapezoid as $Area_{x_0,x_1} = dx / 2.0 * (f_{x_0} + f_{x_1})$

and cur_dim_areas contains the function values at the points

the code currently is

  for cur_dim in range(dim):
            cur_dim_areas = (
                hs[cur_dim] / 2.0 * (cur_dim_areas[..., 0:-1] + cur_dim_areas[..., 1:])
            )
            cur_dim_areas = anp.sum(cur_dim_areas, axis=dim - cur_dim - 1)

Now, off the top of my head, practically, cur_dim_areas would need another dimension added in the front, along which we don't sum in the end.

  for cur_dim in range(dim-1):
            cur_dim_areas = (
                hs[cur_dim] / 2.0 * (cur_dim_areas[..., 0:-1] + cur_dim_areas[..., 1:])
            )
            cur_dim_areas = anp.sum(cur_dim_areas, axis=dim - cur_dim - 1)

So the first dimension is never collapsed / summed over and instead then contains the values for your i,j, (in a flattened way for simplicity?).

This should give a considerable speedup because, instead of performing K x L times the operation over small tensors you do it once over a big one. Memory consumption would ofc increase.

Makes sense?

Unfortunately this is for my thesis haha. But I am happy to help.

Haha, okay :) . Which chair are you doing it at if you don't mind me asking? Just curious.

I am curious if you know why the scipy implementation is so much faster. It seems like torchquad should be faster per-integral but the sum total of the iterations ends up slower. The reason I ask is because there seems to be uncertainty if this will even be faster should we go down the method you suggest.

I don't know what the scipy implementation does internally. But, in practice, in your example you are using a fairly low N=37 and dim=1. So there is not so much to vectorize over. This would however change if we implement this. Because then you vectorize over a K x L x 37 x dim tensor.

See also here: https://github.com/esa/torchquad#performance The continuous lines are GPU. So for ~3e5 points evaluated the GPU starts outscaling the CPU.

gomezzz avatar Dec 21 '22 10:12 gomezzz

Ok yes, this makes a lot of sense - thanks for the concrete code you point to. I will have a closer look. I also probably need a primer on numerical integration so I will probably read the wikipedia pages more closely. If you have any decent sources, I'd welcome them as well (something concise, ideally, but enough to give me the nuts and bolts needed as well).

Is what you're getting at here that the solution might be simpler than the one you proposed above previously? Just curious, I'm going to need to spend some time understanding things first anyway since I really don't know much about numerical integration (but do have some experience with GPU programming fortunately).

Which chair are you doing it at if you don't mind me asking? Just curious.

I think mathematical biology is the answer to your question.

Uh, fun :D What are the dimensionalities of your matrices?

It depends. It could be small, like 10x10 or something, but could also be larger like 20x2000 or something. I think an upper bounds is probably 100 in one dimensions and 2500 in the other, but I would need to check to be sure.

ilan-gold avatar Dec 21 '22 11:12 ilan-gold

It also sounds like you're fairly optimistic this will yield a performance gain so I think I will probably devote time to this then in the near-ish future.

ilan-gold avatar Dec 21 '22 11:12 ilan-gold

I also agree with your assessment given what little I know that the speedup would probably be large.

ilan-gold avatar Dec 21 '22 11:12 ilan-gold

*Hopefully large relative to both scipy and the current state, not just the current state...

ilan-gold avatar Dec 21 '22 11:12 ilan-gold

I also probably need a primer on numerical integration so I will probably read the wikipedia pages more closely. If you have any decent sources, I'd welcome them as well (something concise, ideally, but enough to give me the nuts and bolts needed as well).

There are some nice youtube tutorials https://www.youtube.com/watch?v=ZSbho8v0mGU and then higher dimensionality https://www.youtube.com/watch?v=1Q7TU1uWNbE

We can also have a quick call and I can walk you through the code if that would help.

Is what you're getting at here that the solution might be simpler than the https://github.com/esa/torchquad/issues/155#issuecomment-1165315234? Just curious, I'm going to need to spend some time understanding things first anyway since I really don't know much about numerical integration (but do have some experience with GPU programming fortunately).

I think it's quite similar but I formulated it a bit more concretely now (but only covers the bullet point cases, which yours falls under). In general, If you think about it, torchquad is only a subset of possible integration problems you can solve given that we only support n-cube integration domains (see https://github.com/sigma-py/quadpy for an impression what one can implement) functions mapping to single real (or maybe complex, I forgot :thinking: ) values etc. So there are overall many things, one could implement. But, I think, the above described is very immanently useful as it would solve your problem, a related one on another project I am working on (https://github.com/darioizzo/geodesyNets/ computing gravitational potential many times for different points) and also, others have asked me via email for this feature, so it's a good thing to go for.

It depends. It could be small, like 10x10 or something, but could also be larger like 20x2000 or something. I think an upper bounds is probably 100 in one dimensions and 2500 in the other, but I would need to check to be sure.

Okay, yes for these you should definitely see a huge speedup given that instead of performing 1e3 - 1e6 times a small matrix addition , you will be performing one big one.

gomezzz avatar Dec 21 '22 12:12 gomezzz

Great, I will get to work soon then. I think a call could be helpful once I give this an initial shot and do a bit more reading. Definitely in the new year! Thanks so much!

ilan-gold avatar Dec 21 '22 13:12 ilan-gold

Perfect! Just ping when me you wanna have a call. Happy holidays until then!

gomezzz avatar Dec 21 '22 14:12 gomezzz

@gomezzz Sorry, a related question, and a bit more open ended. One thing I want to do is be able to take repeated versions of this grid-integral over increasing time horizons i.e nested domains with the same starting point but different endpoints (but still uniformly over the matrix, so again different yet simpler than the OP of this issue). Do you have any suggestions? Is https://torchquad.readthedocs.io/en/main/tutorial.html#speedups-for-repeated-quadrature the best way, do you think (but once this feature is done)?

ilan-gold avatar Dec 22 '22 07:12 ilan-gold

i.e nested domains with the same starting point but different endpoints

But the evaluated function doesn't change, right? (if it does, then you need to reevaluate most of it, I think)

but still uniformly over the matrix

so, the matrix is 10x10 corresponding on a [0,1] function interval and then 15x10 (or 15x15 or sth) for [0,1.5] over the same function?

gomezzz avatar Dec 22 '22 08:12 gomezzz

@gomezzz no not exactly, sorry i wasn't clear - more like a 10x10 matrix on the [0,1] interval and then a 10x10 on [0, 1.5]

ilan-gold avatar Dec 22 '22 08:12 ilan-gold

Ok, so no hope of keeping any matrix values, I guess. I think you best bet would then be something like:

integrator = Boole()

# precompute compute a grid on what you expect to be maximum domain
integration_domain = [0,10]
grid = integrator.calculate_grid(N, integration_domain)

# sample function one time on the large domain
function_values, _ = integrator.evaluate_integrand(integrand1, grid_points)

# Now subsample the function values for the subdomain and apply the any other terms you are currently interested in
subdomain_function_values = (... you need to code this :P)
subdomain_n_per_dim = (get from above subsampled)

# Compute for integral for that subdomain
integral1 = integrator.calculate_result(subdomain_function_values, dim, subdomain_n_per_dim, hs)

You could get more speedup if you store more parts of the computations. It also depends where your computational bottleneck is. Whether the function evaluations are expensive, or the calculate_result in the end etc.

Helps?

gomezzz avatar Dec 22 '22 09:12 gomezzz

Right, the integrand should be fairly inexpensive (just some exponential sum/quotients) so this looks like the move. Thanks!

ilan-gold avatar Dec 22 '22 09:12 ilan-gold

I got the newton-cotes to work. It seems. I will need to add some tests to be sure, but it looked good. It seems like there's a floor on how long things take - it's dependent on size of the grid and size of N but for N=37, the break-even point (at least for my integral) with scipy was around 40x40

ilan-gold avatar Dec 22 '22 17:12 ilan-gold

This lines up right with what you said previously as well, @gomezzz.

ilan-gold avatar Dec 22 '22 17:12 ilan-gold

You could get more speedup if you store more parts of the computations. It also depends where your computational bottleneck is. Whether the function evaluations are expensive, or the calculate_result in the end etc.

Is there any rule of thumb here @gomezzz? Like, when would evaluate_integrand dominate calculate_result? IF the integrand is really complicated and compute-intensive? Seems like calculate_result's complexity is integrand-independent. Sorry for all the questions!

ilan-gold avatar Dec 22 '22 17:12 ilan-gold

I got the newton-cotes to work. It seems. I will need to add some tests to be sure, but it looked good. It seems like there's a floor on how long things take - it's dependent on size of the grid and size of N but for N=37, the break-even point (at least for my integral) with scipy was around 40x40

Cool! Very nice! Feel free to open a (draft) PR as soon as you want any feedback. Maybe there are further optimization opportunities. :v:

In general, N=37 is also quite small, not sure how much accuracy you need? You could also try if Trapezoid with larger N chosen so that you get similar errors as with Simpsons gives you better scaling.

gomezzz avatar Dec 23 '22 09:12 gomezzz

Is there any rule of thumb here @gomezzz? Like, when would evaluate_integrand dominate calculate_result? IF the integrand is really complicated and compute-intensive? Seems like calculate_result's complexity is integrand-independent. Sorry for all the questions!

So, it all depends on devices and problem complexity.

calculate_grid(N, integration_domain) is mostly about allocating the memory and cheap in general

integrator.evaluate_integrand(integrand1, grid_points) costs N * C_f for N points and C_f function cost. It is parallelized perfectly over function calls (all happen at once)

calculate_result(subdomain_function_values, dim, subdomain_n_per_dim, hs) calculates a bunch of sums and multiplications over the tensors holding the results for each dimension. A simple upper bound on cost is N_dim * M^(N_dim) for N_dim number of dims, and M^(N_dim) size of n-dimensional tensor. (It is less because we collapse the dims but I am too lazy to compute something more accurate).

And now the devices come in, because on CPU the vectorization doesn't matter as much here, I think. on GPU, if you device is quick, it may be memory bound for small problems. If your integrand is something expensive containing sins, sqrt etc. the evaluations may dominate.

gomezzz avatar Dec 23 '22 10:12 gomezzz

For a deeper dive into the topic, also see the mentioned master thesis: https://mediatum.ub.tum.de/604993?query=Hofmeier&show_id=1694577 :)

gomezzz avatar Dec 23 '22 10:12 gomezzz

Cool! Very nice! Feel free to open a (draft) PR as soon as you want any feedback. Maybe there are further optimization opportunities. :v:

Yes, I will definitely need your opinion on some API stuff. I think I want to do the non-deterministic ones as well and then open one PR for both.

In general, N=37 is also quite small, not sure how much accuracy you need? You could also try if Trapezoid with larger N chosen so that you get similar errors as with Simpsons gives you better scaling.

Hmmm yes, I think I will have to look closer. I don't think accuracy is crazy important, to the point of it needing to be ridiculous, especially since the function is basically constant for most of the bounds of integration. Thanks for the feedback!

ilan-gold avatar Dec 23 '22 10:12 ilan-gold

Also, yes it's possible my integrand is expensive since it involves a sum/quotient of complex exponentials (which maybe are evaluated as sin/cos?) but really don't know. So that then this sort of caching would be big, in theory.

ilan-gold avatar Dec 23 '22 10:12 ilan-gold