hipBLASLt icon indicating copy to clipboard operation
hipBLASLt copied to clipboard

[Draft] Distributed tuning

Open brianchooou opened this issue 1 year ago • 1 comments

[Draft] TensileParallel Documentation

Overview

TensileParallel is an enhancement to the original Tensile tuning tool that enables parallel tuning across multiple GPU devices. It optimizes the tuning process by distributing workloads across available GPUs, significantly reducing the total tuning time.

Features

  • Multi-GPU support for parallel tuning
  • Automatic workload distribution and load balancing
  • Fallback mechanism to standard Tensile execution
  • Comprehensive logging and error handling
  • Automatic results merging from multiple devices

Prerequisites

  • ROCm environment with hipBLASLt installed
  • Python 3.x
  • Multiple AMD GPU devices (optional)

Installation

No additional installation required beyond the standard hipBLASLt setup.

Usage

Basic Command

cd /hipBLASLt/tensilelite
./Tensile/bin/TensileParallel <config.yaml> <output_path>

Configuration for Device Selection

Modify your config.yaml to specify GPU devices using the DeviceList parameter under GlobalParameters:

  1. Specific Devices
GlobalParameters:
	...
  DeviceList: [1, 2, 3]  # Use GPUs 1, 2, and 3
  1. All Available Devices
GlobalParameters:
	...
  DeviceList: [-1]  # Use all available GPUs
  1. Default Behavior
  • If DeviceList is not specified or empty, falls back to standard Tensile execution
  • If specified devices are unavailable, falls back to standard Tensile execution

Execution Flow

  1. Configuration Loading
  • Validates input configuration
  • Checks device availability
  1. Workload Distribution
  • Analyzes problem sizes
  • Distributes workload based on complexity
  • Generates device-specific configurations
  1. Parallel Execution
  • Runs tuning processes on selected devices
  • Monitors progress and handles errors
  1. Results Processing
  • Merges results from all devices
  • Generates execution summary
  • Creates consolidated output

Output Structure

output_path/
├── config_gpu_*.yaml        # Device-specific configurations
├── outputs/
│   ├── gpu_0/              # Results from GPU 0
│   ├── gpu_1/              # Results from GPU 1
│   └── ...
├── merged_output/          # Final merged results

brianchooou avatar Dec 10 '24 04:12 brianchooou

Please resolve merge conflicts or close this PR to complete the task of importing PRs from this repo to the monorepo.

jayhawk-commits avatar Jun 20 '25 18:06 jayhawk-commits