hypervolume icon indicating copy to clipboard operation
hypervolume copied to clipboard

Rely on deques or similar

Open Ironholds opened this issue 8 years ago • 10 comments

The push_back() style operations around vectors are vulnerable to slowdowns and reallocations since vectors are stored in contiguous memory. I'd strongly suggest using (say) deques instead, unless memory is a more important concern than speed. Would you like me to take a stab at implementing it?

Ironholds avatar May 28 '17 04:05 Ironholds

@Ironholds: Thanks Oliver! Before you take a stab at it, I think @bblonder and I should figure something out first (below).

@bblonder: I just compared my implementation (#1) versus the one implemented in 2.0.5. Mine spends almost zero time in the kd-tree code, versus 80% for 2.0.5. Any idea what's behind the discrepancy? If we could figure that out, then optimizing the kd-tree code wouldn't be necessary.

davharris avatar May 29 '17 19:05 davharris

@bblonder There are at least two differences between your code and mine.

  1. About 99% of the samples that are fed to the kd-ball-checking algorithm are outside of the SVM's hypervolume. That increases the number of samples by ~100x.

  2. You're sampling around every data point instead of just around the support vectors. Assuming my radius is reasonable, this is not necessary to encapsulate the hypervolume; it just makes the tree many times bigger. It also increases the rejection rate, which leads to more runs of the KD tree code than would otherwise be needed.

davharris avatar May 29 '17 20:05 davharris

@davharris,

There is now a 2.0.6 version that goes back to building the elliptical grid using only the SVM points. This dramatically increases speed in the microbenchmarks I tried.

I also changed some point density arguments to improve runtimes in high dimensionalities, addressing point #2. However I do not see how point #1 works out - the same radius is being used in your version and my version. If you have an alternate proposal, please suggest and we can add it.

@Ironholds,

I am open to using different data structures here. Please feel free to propose a re-implementation using deques. All the relevant code is probably in convert.cpp. If you can demonstrate your new solution gives the same outputs and is faster, then I will add it to the package gratefully and acknowledge your efforts.

Ben

On Mon, May 29, 2017 at 9:19 PM, David J. Harris [email protected] wrote:

@bblonder https://github.com/bblonder There are at least two differences between your code and mine.

About 99% of the samples that are fed to the kd-ball-checking algorithm are outside of the SVM's hypervolume. That increases the number of samples by ~100x. 2.

You're sampling around every data point instead of just around the support vectors. Assuming my radius is reasonable, this is not necessary to encapsulate the hypervolume; it just makes the tree many times bigger. It also increases the rejection rate, which leads to more runs of the KD tree code than would otherwise be needed.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bblonder/hypervolume/issues/2#issuecomment-304725346, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnvP_OsRwx2zNr1gLoGauvbbtyRmjGQks5r-yhYgaJpZM4Noi-P .

-- Benjamin Blonder Environmental Change Institute School of Geography and the Environment University of Oxford

Website + photoblog: http://www.benjaminblonder.org Google Scholar: http://scholar.google.com/citations?user=l3FNoE8AAAAJ University of Arizona Sky School: https://skyschool.arizona.edu/

bblonder avatar May 30 '17 10:05 bblonder

@bblonder In my code, we:

  1. sample from the hypersphere
  2. throw out 99% of the points because they're outside the SVM boundary
  3. build the KD tree

If I understand correctly, you're switching steps 2 and 3 relative to my code, which makes your KD tree need to deal with 100x more data points than mine.

davharris avatar Jun 04 '17 20:06 davharris

@davharris,

Yes, that is correct. I made that choice on the principle that evaluating the kdtree more times would be faster than evaluating the model (SVM or KDE) more times. However, open to flipping it around if actually faster. I tried to do this quickly today but can't seem to get the volume estimates to come out right (see sample_model_ellipsoid_dave_broken()). If you have time and want to have a look, please feel free to edit this function.

Ben

On Sun, Jun 4, 2017 at 4:58 PM, David J. Harris [email protected] wrote:

@bblonder https://github.com/bblonder In my code, we:

  1. sample from the hypersphere
  2. throw out 99% of the points because they're outside the SVM boundary
  3. build the KD tree

If I understand correctly, you're switching steps 2 and 3 relative to my code, which makes your KD tree need to deal with 100x more data points than mine.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bblonder/hypervolume/issues/2#issuecomment-306066508, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnvP8Kjq68U_C6p4oDkr0k2V8PknEoVks5sAxpZgaJpZM4Noi-P .

-- Benjamin Blonder Environmental Change Institute School of Geography and the Environment University of Oxford

Website + photoblog: http://www.benjaminblonder.org Google Scholar: http://scholar.google.com/citations?user=l3FNoE8AAAAJ University of Arizona Sky School: https://skyschool.arizona.edu/

bblonder avatar Jun 05 '17 17:06 bblonder

Cool. Don't have time at the moment, but the first thing I'd check is the denominator. You'll want it to equal the number of full_samples, I think. You can double-check my code in case I'm misremembering.

Dave

On Jun 5, 2017, at 10:59, Benjamin Blonder [email protected] wrote:

@davharris,

Yes, that is correct. I made that choice on the principle that evaluating the kdtree more times would be faster than evaluating the model (SVM or KDE) more times. However, open to flipping it around if actually faster. I tried to do this quickly today but can't seem to get the volume estimates to come out right (see sample_model_ellipsoid_dave_broken()). If you have time and want to have a look, please feel free to edit this function.

Ben

On Sun, Jun 4, 2017 at 4:58 PM, David J. Harris [email protected] wrote:

@bblonder https://github.com/bblonder In my code, we:

  1. sample from the hypersphere
  2. throw out 99% of the points because they're outside the SVM boundary
  3. build the KD tree

If I understand correctly, you're switching steps 2 and 3 relative to my code, which makes your KD tree need to deal with 100x more data points than mine.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bblonder/hypervolume/issues/2#issuecomment-306066508, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnvP8Kjq68U_C6p4oDkr0k2V8PknEoVks5sAxpZgaJpZM4Noi-P .

-- Benjamin Blonder Environmental Change Institute School of Geography and the Environment University of Oxford

Website + photoblog: http://www.benjaminblonder.org Google Scholar: http://scholar.google.com/citations?user=l3FNoE8AAAAJ University of Arizona Sky School: https://skyschool.arizona.edu/ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

davharris avatar Jun 05 '17 18:06 davharris

Should I wait for this discussion to resolve before messing with implementations?

Ironholds avatar Jun 06 '17 02:06 Ironholds

Hi Oliver,

No, feel free to try a deque implementation now. My discussion with Dave sits at an abstracted level above the problem you are interested in, so they can probably both be solved independently.

Ben

On Mon, Jun 5, 2017 at 10:37 PM, Oliver Keyes [email protected] wrote:

Should I wait for this discussion to resolve before messing with implementations?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bblonder/hypervolume/issues/2#issuecomment-306365114, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnvP0PTc1BHKOo4I_sdsD9ydA0mh8Vhks5sBLtMgaJpZM4Noi-P .

-- Benjamin Blonder Environmental Change Institute School of Geography and the Environment University of Oxford

Website + photoblog: http://www.benjaminblonder.org Google Scholar: http://scholar.google.com/citations?user=l3FNoE8AAAAJ University of Arizona Sky School: https://skyschool.arizona.edu/

bblonder avatar Jun 06 '17 12:06 bblonder

Alright, well my machine is a Mac which can no longer locally install hypervolume* without some pain. So this may have to take an L for a while. But I'd thoroughly recommend literally just swapping out std::vector for std::deque and seeing how it plays.

  • hypervolume requires hitandrun which requires rcdd which requires GNU MP (fine) and GCC, even on a mac (what?!)

Ironholds avatar Jun 08 '17 00:06 Ironholds

Installing the Developer Tools should include a copy of gcc and let you get started. You might also be able to install a binary version of hitandrun to avoid this issue.

Unfortunately it is not so simple to do this swap-out - the KDTree.h file uses a few calls that are implemented by std::vector and not by std::deque. I don't have time to fix this up myself right now, but here at least is a benchmarking script you can use to see if doing this will help:

library(microbenchmark)

dohv <- function(np, d) { data = rnorm(np*d) data = matrix(data,ncol=d)

mb = microbenchmark(hypervolume_svm(data), hypervolume_gaussian(data),times=2) return(mb) }

params <- expand.grid(np=c(100,500,1000),d=c(2,3,4,5))

results <- lapply(1:nrow(params), function(i) { dohv(params$np[i], params$d[i]) })

results_bound = cbind(params, t(sapply(results, function(x) {tapply(x$time,x$expr, mean)}))) names(results_bound) <- c("np","d","svm","gaussian") write.csv(results_bound,row.names=F,file='~/Desktop/hv_vector.csv')

Still want to try looking into this?

Ben

On Wed, Jun 7, 2017 at 8:52 PM, Oliver Keyes [email protected] wrote:

Alright, well my machine is a Mac which can no longer locally install hypervolume* without some pain. So this may have to take an L for a while. But I'd thoroughly recommend literally just swapping out std::vector for std::deque and seeing how it plays.

  • hypervolume requires hitandrun which requires rcdd which requires GNU MP (fine) and GCC, even on a mac (what?!)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bblonder/hypervolume/issues/2#issuecomment-306966776, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnvP7tw9uWvrm89To36S_X-Co0BU0rkks5sB0XIgaJpZM4Noi-P .

-- Benjamin Blonder Environmental Change Institute School of Geography and the Environment University of Oxford

Website + photoblog: http://www.benjaminblonder.org Google Scholar: http://scholar.google.com/citations?user=l3FNoE8AAAAJ University of Arizona Sky School: https://skyschool.arizona.edu/

bblonder avatar Jun 08 '17 12:06 bblonder