TensorComprehensions icon indicating copy to clipboard operation
TensorComprehensions copied to clipboard

related to #347 - Examples And Performance Results

Open keightyfive opened this issue 6 years ago • 3 comments

Hi, so I had to slightly modify the autotuner_parallel.sh script, and I still have a few questions:

  1. Does the script work in srun mode as well?
  2. What exactly does the script do? Does it tune the mapping for all the benchmarks and then runs all of these 1000 times as described in the paper?
  3. Is there a way to run the kernels individually without autotuning?
  4. Do they run only on NVidia GPUs, or can one run them on CPUs as well?
  5. I installed TC using the conda package with pytorch integration (https://facebookresearch.github.io/TensorComprehensions/framework/pytorch_integration/getting_started.html#installation), since the building from source won't work, which I reported in a separate issue #407. Will I need to install the conda package for caffee2 as well for running the benchmarks?

Cheers, Kevin

keightyfive avatar May 11 '18 19:05 keightyfive

Has this site gone dead already??

keightyfive avatar May 18 '18 16:05 keightyfive

Has this site gone dead already??

During conference submission crunch time, yes unfortunately :) but we're now back

Reg. benchmarking, we have been making progress in #423 during this week, best is to wait until we land it early next week. I had to resort to some build hacks following the caffe2 source of truth moving to pytorch that I will need to clean up.

Reg. your previous questions:

Does the script work in srun mode as well?

I have only run myself in sbatch mode, it should run in srun mode with minor modifications there is nothing magical about it. The current script uses SLURM_ARRAY_JOB_ID to set a path but you can easily adapt and set the path to whatever.

What exactly does the script do? Does it tune the mapping for all the benchmarks and then runs all of these 1000 times as described in the paper?

Essentially yes. We have reduced to 100 by default but essentially yes. If you want to run 1K times pass --benchmark_iterations=1000

Is there a way to run the kernels individually without autotuning?

Yes if you checkout the branch from #423 you can then just run ./build/tc/benchmarks/benchmark_xxx --gtest_filter="*P100*" for the Pascal benchmarks and V100 for the Volta benchmarks. We have saved the best options (see tc/benchmarks/*.h) so numbers are easily reproducible.

Do they run only on NVidia GPUs, or can one run them on CPUs as well?

GPU only for now, I'm prioritizing CPU starting this week, I would say it will take about 1 month to get things in a decent state.

I installed TC using the conda package with pytorch integration (https://facebookresearch.github.io/TensorComprehensions/framework/pytorch_integration/getting_started.html#installation), since the building from source won't work, which I reported in a separate issue #407. Will I need to install the conda package for caffee2 as well for running the benchmarks?

TC + pytorch is still in an extremely alpha state right now, I haven't had the chance to benchmark it myself yet. For the benchmarks we report perf on, it's C++ only atm. Reg. build system we're definitely not happy about the user experience there but given the resources we have it is what it is (i.e. it works for the core dev team). I'd be happy to help you set up if you are interested in working at that level. We also accept contributions from the community since this is still an early research project with extremely scarce resources.

nicolasvasilache avatar May 19 '18 18:05 nicolasvasilache

Thanks very much for your detailed answers. I'll then try to install with caffe2 and see if I can run the kernels. I assume the overall performance will be better after building from source though... if you are willing to help me that, that would be very greatly appreciated. Hopefully I can contribute in some shape or form... Cheers, Kevin

keightyfive avatar Jun 01 '18 15:06 keightyfive