rehashing icon indicating copy to clipboard operation
rehashing copied to clipboard

Running the experimental pipeline?

Open maumueller opened this issue 3 years ago • 4 comments

Hi all!

Let me first thank you for making your source code available!

I was interested in reproducing some of your runs (Table 1 and Figure 3) and found it a bit difficult to do. I found some code to work on the covertype dataset, but nothing else. Is there documentation or some shell scripts that allow me to run more experiments? :-) If not, would it be possible to get a short overview how the code is supposed to be run?

Thank you!

Martin

maumueller avatar Oct 06 '20 09:10 maumueller

Hi Martin,

thank you for your interest in our work!

Although @kexinrong has written all of the cpp code, I will try to the best of my knowledge give you a high-level idea of how to run the experiments until Kexin can chime in. There are roughly two separate folders to look into:

  • Benchmark: includes instructions for ASKIT and FIGTree.
  • HBE: includes high level instructions for HBE. The basic way the main programs interact with files is through a config file. This assumes that the data have been normalized so that each dimension of the data has variance 1.

In terms of the experiments, for each dataset in question there are a few things that need to be done:

  1. Preprocess the dataset in order for each column to have variance 1.
  2. Given a kernel (e.g. gaussian) and a bandwidth (typically set to sqrt(2) * n**(-1./(d+4)) for gaussian), compute the ground truth kernel density for the dataset using ComputeExact.cpp that reads a config file where the number of random queries M is specified.
  3. At this point you can run the experiment using ASKIT or FIGTRee.
  4. For HBE, RS:
    • Run FindAdaptiveEps to find the epsilon parameter for RS, HBE to use in the experiments for comparison.
    • Run BatchBenchmark to get results on RS, and variants of HBE.

psiminelakis avatar Oct 06 '20 15:10 psiminelakis

Paris - Thanks for the information!

Hi Martin:

Thanks for reaching out. Please refer to Pari's comments on the overall project structure and preprocessing. A few additions/clarifications: For Table 1: RunAdaptive would be the main program for timing/accuracy measurements. BatchBenchmark is for results related to sketching. For Figure 3: Diagnosis outputs the estimates of relative variances for RS and HBE for a given dataset.

You can change the main program by modifying the executable in cmake.

Hope it helps!

kexinrong avatar Oct 07 '20 00:10 kexinrong

Thank you, @kexinrong and @psiminelakis! I'll try to work from what you described. It's a bit unfortunate that there is no bash script that exemplifies how to run everything for at least one of the datasets.

I hope it's ok if I keep this open and get back to you if I have more questions.

maumueller avatar Oct 08 '20 20:10 maumueller

One additional note: with the new updated datasets, it should be easier to get an example run. For example, if you compile the RunAdaptive as the main executable, you can run adaptive sampling with RS, with eps=0.2 using ./hbe conf/shuttle.cfg gaussian 0.2 true To run adaptive sampling with HBE, with eps=0.9 ./hbe conf/shuttle.cfg gaussian 0.9 The top of each main program contains some comments on example usage.

I also suggest start trying with the shuttle dataset since it's smaller and easier to debug.

kexinrong avatar Oct 09 '20 04:10 kexinrong