botorch
botorch copied to clipboard
[Feature Request] Hypervolume computation with BoTorch is thousands of times slower than with moocore
🚀 Feature Request
Motivation
https://botorch.org/api/utils.html#botorch.utils.multi_objective.hypervolume.Hypervolume
says: TODO: write this in C++ for faster looping.
There is a C (C++ compatible) implementation available here: https://github.com/MLopez-Ibanez/eaf/blob/master/src/mo-tools/hv.c
Yeah, that would be nice. Using box decompositions (DominatedPartitioning) to compute hypervolumes seems to work more efficiently than the dimension-sweep algorithm, empirically. Both would likely benefit from C++
We'd also need this to be auto-differentiable in pytorch though. So the idea was to (maybe) write this using the pytorch C++ frontend as an extension.
The current implementation is not differentiable and is simply used for evaluation purposes.
Oh I see different implementation, my bad.
Yeah, that would be nice. Using box decompositions (
DominatedPartitioning) to compute hypervolumes seems to work more efficiently than the dimension-sweep algorithm, empirically. Both would likely benefit from C++
Maybe. I believe the state-of-the-art is pretty well summarised here: https://dl.acm.org/doi/fullHtml/10.1145/3453474#tab1 And there it recommends the 2D and 3D dimension-sweep algorithms for 2D and 3D. The box-decomposition is only recommended for 5D and 6D.
In any case, I would be surprised if the Python implementation of either algorithm can get anywhere close to the C version of dimension-sweep for 2D and 3D.
It would be nice to have a highly-optimized C/C++ library of fundamental algorithms (not only hypervolume, but dominance filtering, nondominated sorting, other metrics like IGD+ and epsilon, etc) for multi-objective optimization that can be used in R and Python.
There is a C (C++ compatible) implementation available here: https://github.com/MLopez-Ibanez/eaf/blob/master/src/mo-tools/hv.c
By the way, I'm happy to modify the C code above to make it easier to integrate into BoTorch.
It would be nice to have a highly-optimized C/C++ library of fundamental algorithms (not only hypervolume, but dominance filtering, nondominated sorting, other metrics like IGD+ and epsilon, etc) for multi-objective optimization that can be used in R and Python.
Indeed, that would be nice.
By the way, I'm happy to modify the C code above to make it easier to integrate into BoTorch.
BoTorch currently doesn't contain any compiled code, which simplifies building and packaging a lot, and so I'd only want to add this to BoTorch if it brings substantial benefits. That said, if this lived in a separate library that is well implemented and maintained I'd be happy to add it as a (potentially optional) dependency. This might more sense especially if the scope for this work is broader and potentially includes more general python and R bindings.
I did some quick experiments comparing the BoTorch implementation (botorch-0.12.0) of the hypervolume to the one provided by moocore (https://multi-objective.github.io/moocore/python/), which is implemented in C.
It seems BoTorch is several thousands times slower. Maybe I am doing something wrong? I used this code.
import numpy as np
import moocore
from botorch.utils.multi_objective.hypervolume import Hypervolume as botorch_HV
import torch
import timeit
def read_data(name):
files = {"DTLZLinearShape.3d": "~/work/perfassess/hyperv/uc/test/inp/DTLZLinearShape.3d.front.1000pts.10",
"DTLZLinearShape.4d" : "~/work/perfassess/hyperv/uc/test/inp/DTLZLinearShape.4d.front.1000pts.10"}
x = moocore.read_datasets(files[name])[:, :-1]
x = moocore.filter_dominated(x, maximise=True)
return x
name = "DTLZLinearShape.3d"
x = read_data(name)
ref = np.zeros(x.shape[1])
botorch_hv = botorch_HV(ref_point=torch.from_numpy(ref))
moocore_hv = moocore.Hypervolume(ref = ref, maximise = True)
n = np.arange(500, 3000 + 1, 500)
botorch_values = []
moocore_values = []
botorch_times = []
moocore_times = []
for maxrow in n:
z = x[:maxrow, :]
z_torch = torch.from_numpy(z)
botorch_values += [botorch_hv.compute(z_torch)]
moocore_values += [moocore_hv(z)]
duration = timeit.timeit('botorch_hv.compute(z_torch)', globals=globals(), number = 2)
botorch_times += [duration]
duration = timeit.timeit('moocore_hv(z)', globals=globals(), number = 2)
moocore_times += [duration]
botorch_values = np.array(botorch_values)
moocore_values = np.array(moocore_values)
assert np.allclose(botorch_values, moocore_values)
botorch_times = np.array(botorch_times)
moocore_times = np.array(moocore_times)
import pandas as pd
import matplotlib.pyplot as plt
cpu = "Intel i5-6200U 2.30GHz"
df = pd.DataFrame(dict(n = n, Ratio = botorch_times/moocore_times)).set_index('n')
df.plot(grid=True, ylabel='Botorch/moocore (seconds)', title = f'HV computation for {name} ({cpu})')
plt.show()
I'll let somebody else weigh in on whether this is the right way to benchmark, but for reference, which algorithm are you using for computing HV in moocore?
On Thu, Nov 28, 2024 at 12:34 PM Manuel López-Ibáñez < @.***> wrote:
I did some quick experiments comparing the BoTorch implementation (botorch-0.12.0) of the hypervolume to the one provided by moocore ( https://multi-objective.github.io/moocore/python/), which is implemented in C.
Figure_1.png (view on web) https://github.com/user-attachments/assets/be4d224e-025c-4382-85a0-c97928615b94
It seems BoTorch is several thousands times slower. Maybe I am doing something wrong? I used this code.
import numpy as npimport moocore from botorch.utils.multi_objective.hypervolume import Hypervolume as botorch_HVimport torch import timeit def read_data(name): files = {"DTLZLinearShape.3d": "~/work/perfassess/hyperv/uc/test/inp/DTLZLinearShape.3d.front.1000pts.10", "DTLZLinearShape.4d" : "~/work/perfassess/hyperv/uc/test/inp/DTLZLinearShape.4d.front.1000pts.10"} x = moocore.read_datasets(files[name])[:, :-1] x = moocore.filter_dominated(x, maximise=True) return x
name = "DTLZLinearShape.3d"x = read_data(name) ref = np.zeros(x.shape[1])botorch_hv = botorch_HV(ref_point=torch.from_numpy(ref))moocore_hv = moocore.Hypervolume(ref = ref, maximise = True) n = np.arange(500, 3000 + 1, 500) botorch_values = []moocore_values = []botorch_times = []moocore_times = []for maxrow in n: z = x[:maxrow, :] z_torch = torch.from_numpy(z) botorch_values += [botorch_hv.compute(z_torch)] moocore_values += [moocore_hv(z)] duration = timeit.timeit('botorch_hv.compute(z_torch)', globals=globals(), number = 2) botorch_times += [duration] duration = timeit.timeit('moocore_hv(z)', globals=globals(), number = 2) moocore_times += [duration] botorch_values = np.array(botorch_values)moocore_values = np.array(moocore_values)assert np.allclose(botorch_values, moocore_values) botorch_times = np.array(botorch_times)moocore_times = np.array(moocore_times) import pandas as pdimport matplotlib.pyplot as pltcpu = "Intel i5-6200U 2.30GHz"df = pd.DataFrame(dict(n = n, Ratio = botorch_times/moocore_times)).set_index('n')df.plot(grid=True, ylabel='Botorch/moocore (seconds)', title = f'HV computation for {name} ({cpu})')plt.show()
— Reply to this email directly, view it on GitHub https://github.com/pytorch/botorch/issues/1685#issuecomment-2506562549, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAW34KYSH52YGJX6OHB6D32C5HZFAVCNFSM6AAAAABSVTGDASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMBWGU3DENJUHE . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Documented here: https://multi-objective.github.io/moocore/python/reference/generated/moocore.hypervolume.html#moocore.hypervolume
I haven't looked into the details, but I wouldn't be surprised if the implementation of this particular algorithm were indeed many orders of magnitude slower. This algorithm is very loopy so this is going to be super slow in python to begin with. On top of that this is done in pytorch so there is probably a lot of overhead from building the (very complex) compute graph in the background. This algorithm in botorch really shouldn't be used (and it's not used in our multi-objective acquisition functions). We should probably make this more clear in the docs and potentially just remove this algorithm.
It would be nice to have a highly-optimized C/C++ library of fundamental algorithms (not only hypervolume, but dominance filtering, nondominated sorting, other metrics like IGD+ and epsilon, etc) for multi-objective optimization that can be used in R and Python.
Indeed, that would be nice.
Well, that is what moocore is meant to be.
https://botorch.org/api/utils.html#botorch.utils.multi_objective.pareto.is_non_dominated
versus
https://multi-objective.github.io/moocore/python/reference/generated/moocore.is_nondominated.html
import numpy as np
import moocore
import torch
import timeit
from botorch.utils.multi_objective.pareto import is_non_dominated as botorch_is_non_dominated
name = "CPFs.txt"
x = moocore.get_dataset(name)[:, :-1]
botorch_times = []
moocore_times = []
n = np.arange(100, len(x) + 1, 250)
botorch_times = []
moocore_times = []
for maxrow in n:
z = x[:maxrow, :]
z_torch = torch.from_numpy(z)
botorch_values = botorch_is_non_dominated(z_torch)
duration = timeit.timeit('botorch_is_non_dominated(z_torch)', globals=globals(), number = 5)
botorch_times += [duration]
moocore_values = moocore.is_nondominated(z, maximise=True)
duration = timeit.timeit('moocore.is_nondominated(z, maximise=True)', globals=globals(), number = 5)
moocore_times += [duration]
assert np.allclose(botorch_values,moocore_values)
botorch_times = np.array(botorch_times)
moocore_times = np.array(moocore_times)
import pandas as pd
import matplotlib.pyplot as plt
cpu = "Intel i5-6200U 2.30GHz"
df = pd.DataFrame(dict(n = n, BoTorch = botorch_times, moocore = moocore_times)).set_index('n')
df.plot(grid=True, ylabel='CPU time (seconds)', title = f'is_non_dominated() for {name} ({cpu})')
plt.yscale("log")
plt.show()
I'm curious about what difference in timing it makes if you set *deduplicate=Falseinis_non_dominated`.
I'm curious about what difference in timing it makes if you set *deduplicate=False
inis_non_dominated`.
Both seem to be slightly slower.
import numpy as np
import moocore
import torch
import timeit
from botorch.utils.multi_objective.pareto import is_non_dominated as botorch_is_non_dominated
name = "CPFs.txt"
x = moocore.get_dataset(name)[:, :-1]
botorch_times = []
moocore_times = []
n = np.arange(100, len(x) + 1, 250)
botorch_times = []
moocore_times = []
for maxrow in n:
z = x[:maxrow, :]
z_torch = torch.from_numpy(z)
botorch_values = botorch_is_non_dominated(z_torch,deduplicate=False)
duration = timeit.timeit('botorch_is_non_dominated(z_torch,deduplicate=False)', globals=globals(), number = 10)
botorch_times += [duration]
moocore_values = moocore.is_nondominated(z, maximise=True, keep_weakly = True)
duration = timeit.timeit('moocore.is_nondominated(z, maximise=True, keep_weakly = True)', globals=globals(), number = 10)
moocore_times += [duration]
assert np.allclose(botorch_values,moocore_values)
botorch_times = np.array(botorch_times)
moocore_times = np.array(moocore_times)
import pandas as pd
import matplotlib.pyplot as plt
cpu = "Intel i5-6200U 2.30GHz"
df = pd.DataFrame(dict(n = n, BoTorch = botorch_times, moocore = moocore_times)).set_index('n')
df.plot(grid=True, ylabel='CPU time (seconds)', title = f'is_non_dominated(deduplicate=False) for {name} ({cpu})')
plt.yscale("log")
plt.show()
I added this rule to is_non_dominated to avoid memory issues and very long run-times on cpu. It looks like that slows things down quite a bit on a GPU. I ran your benchmark on an A100 and the botorch is_non_dominated quite fast on a GPU once we remove that rule (orange).
import timeit
import torch.utils.benchmark as benchmark
import numpy as np
import torch
from botorch.utils.multi_objective.pareto import _is_non_dominated_loop, MAX_BYTES
from torch import Tensor
def is_non_dominated(
Y: Tensor,
maximize: bool = True,
deduplicate: bool = True,
force_no_loop: bool = True,
) -> Tensor:
n = Y.shape[-2]
if n == 0:
return torch.zeros(Y.shape[:-1], dtype=torch.bool, device=Y.device)
el_size = 64 if Y.dtype == torch.double else 32
if (not force_no_loop) and (
n > 1000 or n**2 * Y.shape[:-2].numel() * el_size / 8 > MAX_BYTES
):
return _is_non_dominated_loop(Y, maximize=maximize, deduplicate=deduplicate)
Y1 = Y.unsqueeze(-3)
Y2 = Y.unsqueeze(-2)
if maximize:
dominates = (Y1 >= Y2).all(dim=-1) & (Y1 > Y2).any(dim=-1)
else:
dominates = (Y1 <= Y2).all(dim=-1) & (Y1 < Y2).any(dim=-1)
nd_mask = ~(dominates.any(dim=-1))
if deduplicate:
# remove duplicates
# find index of first occurrence of each unique element
indices = (Y1 == Y2).all(dim=-1).long().argmax(dim=-1)
keep = torch.zeros_like(nd_mask)
keep.scatter_(dim=-1, index=indices, value=1.0)
return nd_mask & keep
return nd_mask
def get_duration(z, cuda, force_no_loop):
z_torch = torch.tensor(z, device=torch.device("cuda" if cuda else "cpu"))
is_non_dominated(z_torch, deduplicate=False, force_no_loop=force_no_loop)
t0 = benchmark.Timer(
stmt='is_non_dominated(z_torch, deduplicate=False, force_no_loop=force_no_loop)',
globals={'z_torch': z_torch, 'force_no_loop':force_no_loop, "is_non_dominated": is_non_dominated},
)
return t0.timeit(10).mean
name = "|"
n = np.arange(100, len(x) + 1, 250)
botorch_cpu_no_loop_times = []
botorch_cpu_times = []
botorch_gpu_no_loop_times = []
botorch_gpu_times = []
for maxrow in n:
z = x[:maxrow, :]
botorch_cpu_times.append(get_duration(z, cuda=False, force_no_loop=False))
botorch_cpu_no_loop_times.append(get_duration(z, cuda=False, force_no_loop=True))
botorch_gpu_times.append(get_duration(z, cuda=True, force_no_loop=False))
botorch_gpu_no_loop_times.append(get_duration(z, cuda=True, force_no_loop=True))
botorch_cpu_times = np.array(botorch_cpu_times)
botorch_cpu_no_loop_times = np.array(botorch_cpu_no_loop_times)
botorch_gpu_times = np.array(botorch_gpu_times)
botorch_gpu_no_loop_times = np.array(botorch_gpu_no_loop_times)
import matplotlib.pyplot as plt
import pandas as pd
cpu = "GPU"
df = pd.DataFrame(
dict(
n=n,
botorch_cpu=botorch_cpu_times,
botorch_cpu_no_loop=botorch_cpu_no_loop_times,
botorch_gpu=botorch_gpu_times,
botorch_gpu_no_loop=botorch_gpu_no_loop_times,
)
).set_index("n")
df.plot(
grid=True,
ylabel="Time (seconds)",
title=f"is_non_dominated(deduplicate=False)",
)
plt.yscale("log")
plt.show()
I added this rule to
is_non_dominatedto avoid memory issues and very long run-times on cpu. It looks like that slows things down quite a bit on a GPU. I ran your benchmark on an A100 and the botorchis_non_dominatedquite fast on a GPU once we remove that rule (orange).
How does it compare with moocore.is_nondominated ? Note that the dataset that I used comes with the moocore package as well (it is documented here, I am not sure how to document datasets in Python), so you can use that one if you wish to (I don't see how x is defined in your code).
Hi Manuel,
Thanks for raising this performance issue, but it would be helpful if you could clarify your intent here. This method is used internally for our (differentiable) HV computations. It’s written in pure PyTorch because we do not wish to have any C/C++ code or external dependencies since that limits our user base and is frankly quite cumbersome to deal with internally.
If you have an algorithm or way of implementing things that you believe is more performant and would like to contribute to our project, we’d really appreciate any PRs. The performance of moocore is quite impressive and we’d surely benefit from your expertise here.
Kind regards, Eytan
On Thu, Dec 5, 2024 at 9:26 AM Manuel López-Ibáñez @.***> wrote:
I added this rule to is_non_dominated https://github.com/pytorch/botorch/blob/5012fe8a39b434e1b0f3d3a968eb17b3dd0c9e27/botorch/utils/multi_objective/pareto.py#L47-L48 to avoid memory issues and very long run-times on cpu. It looks like that slows things down quite a bit on a GPU. I ran your benchmark on an A100 and the botorch is_non_dominated quite fast on a GPU once we remove that rule (orange).
How does it compare with moocore.is_nondominated https://multi-objective.github.io/moocore/python/reference/generated/moocore.is_nondominated.html#moocore.is_nondominated ? Note that the dataset that I used comes with the moocore package as well (it is documented here https://multi-objective.github.io/moocore/r/reference/CPFs.html, I am not sure how to document datasets in Python), so you can use that one if you wish to (I don't see how x is defined in your code).
— Reply to this email directly, view it on GitHub https://github.com/pytorch/botorch/issues/1685#issuecomment-2520470725, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAW34OR7X3P466VJUPU2YL2EBO77AVCNFSM6AAAAABSVTGDASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRQGQ3TANZSGU . You are receiving this because you commented.Message ID: @.***>
Hi Eytan,
For now my intent is to be sure that what I am observing is a real difference in performance and not an error on my side. I was not expecting such a huge difference!
Ultimately my intent is to provide a C/C++ library, wrapped by high-quality Python and R (and possibly other languages) packages, of high-performing fundamental algorithms (not only hypervolume computation, but dominance filtering, nondominated sorting, other metrics like IGD+ and epsilon, etc). I am frustrated by the lack of good implementations of such fundamental algorithms in Python and R that lead to wasted time and CPU cycles and myths like "hypervolume can only be used with very few points".
It is not my intent to convince you to use moocore as it is clearly not designed with GPUs in mind (at least not yet, maybe in the future, if I get some help or funding...). Nevertheless, knowing when functions in botorch are faster/slower than the equivalent ones in moocore is useful, if I have to advise colleagues or students when to use what option. I may create an example for the documentation of moocore illustrating the differences.
The hypervolume function in botorch is specially bad, since it is clearly too slow (and you have a TODO note there suggesting to use C++). You may either want to remove the function and/or point to better alternatives like moocore or pymoo (although pymoo is still hundreds of times slower than moocore and it is my intention to convince them to switch to moocore).
The hypervolume algorithms for 2D and 3D in moocore are optimal, and I don't think the box-decomposition is worth it with fewer than 5 objectives (happy to be proven otherwise if someone wants to try), so one can only tweak the implementation to save microseconds here and there (or use vectorization/GPUs, but that is currently not in scope for moocore as mentioned above).
I hope the above clarifies my intent!
Kind regards,
Manuel.
Note that the dataset that I used comes with the moocore package as well (it is documented here, I am not sure how to document datasets in Python), so you can use that one if you wish to (I don't see how x is defined in your code).
I used the same dataset. I just copied it, since I didn't have moocore installed as a dependency.
How does it compare with moocore.is_nondominated ?
I didn't explicitly test moocore, since I didn't have it installed, but I would imagine they are at least competitive on a GPU. It looks like there are also some differences in the methods (e.g. it appears moocore is only looking for weakly pareto optimal points, but I would have to take a closer look). It could be that we should be using a faster algorithm on CPU as well.
How does it compare with moocore.is_nondominated ?
I didn't explicitly test moocore, since I didn't have it installed, but I would imagine they are at least competitive on a GPU. It looks like there are also some differences in the methods (e.g. it appears moocore is only looking for weakly pareto optimal points, but I would have to take a closer look). It could be that we should be using a faster algorithm on CPU as well.
The default is to minimise all objectives and remove weakly dominated but you can control that with the arguments maximise=True and keep_weakly=True. In my tests above, I made sure that the output of the functions is identical using assert np.allclose(botorch_values,moocore_values).