pytorch_tabular icon indicating copy to clipboard operation
pytorch_tabular copied to clipboard

Seeking for multi-GPU training

Open SHIMURA0 opened this issue 6 months ago • 6 comments

Hello, I am wondering if there is some way to setup multi-GPU on a single machine for training because my data cannot be loaded on a single GPU (32GB). Thanks in advance!

SHIMURA0 avatar May 16 '25 02:05 SHIMURA0

To set up multi-GPU training on a single machine using PyTorch Lightning with PyTorch Tabular, you can configure the TrainerConfig with the accelerator and devices_list parameters. Here's how you can do it:

from pytorch_tabular import TrainerConfig

trainer_config = TrainerConfig(
    accelerator='gpu',  # Use 'gpu' to enable GPU training
    devices_list=[0, 1]  # Specify the GPUs you want to use, e.g., GPU 0 and 1
)

This configuration will allow you to train your model using multiple GPUs, which can help when your data cannot be loaded on a single GPU with 32GB of memory. Ensure your machine has multiple GPUs available and that they are correctly set up for use with PyTorch Lightning [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

dosubot[bot] avatar May 16 '25 02:05 dosubot[bot]

this does not work, ValidationError: Value '[0, 1]' of type 'list' could not be converted to Integer

SHIMURA0 avatar May 16 '25 02:05 SHIMURA0

By default, PyTorch Tabular uses all GPUs available. Is it not happening in your case? If you have two GPUs, it should be used.

manujosephv avatar May 16 '25 03:05 manujosephv

By default, PyTorch Tabular uses all GPUs available. Is it not happening in your case? If you have two GPUs, it should be used.

The following is my training script

import polars as pl  
import pandas as pd  
import numpy as np  
import torch  
from sklearn.model_selection import train_test_split  
from sklearn.preprocessing import StandardScaler  
from typing import List  
from pytorch_tabular.models import TabNetModelConfig  
from pytorch_tabular import TabularModel  
from pytorch_tabular.config import DataConfig, TrainerConfig, OptimizerConfig  

# 添加 tqdm 进度条  
from tqdm import tqdm  

def remove_all_none_columns(df: pl.DataFrame) -> pl.DataFrame:  
    """  
    移除所有值都为 None 的列,并显示进度条  
    """  
    columns = df.columns  

    # 使用 tqdm 显示移除 None 列的进度  
    non_none_columns = []  
    for col in tqdm(columns, desc="移除空列", unit="列"):  
        if not df[col].is_null().all():  
            non_none_columns.append(col)  

    return df.select(non_none_columns)  

def custom_age_binning(age: float, bins: List[int]) -> int:  
    """  
    根据自定义的年龄分组方案对年龄进行分类  
    """  
    if age < bins[0]:  
        return 0  

    if age >= bins[-1]:  
        return len(bins) - 2  

    for i in range(len(bins) - 1):  
        if bins[i] <= age < bins[i+1]:  
            return i  

    return len(bins) - 2  

def prepare_data_for_tabnet(df: pl.DataFrame, bins: List[int]):  

    df = remove_all_none_columns(df)  
    
    print("转换为 Pandas DataFrame...")  
    pdf = df.to_pandas()  

    pdf = pdf.drop(columns=['sampleid'] if 'sampleid' in pdf.columns else [])  

    print("年龄分箱中...")  
    pdf['age_category'] = list(tqdm(  
        pdf['age'].apply(lambda x: custom_age_binning(x, bins)),  
        desc="年龄分箱",  
        unit="样本"  
    ))  
    
    X = pdf.drop(columns=['age', 'age_category'])  
    y = pdf['age_category']  

    print("特征标准化中...")  
    scaler = StandardScaler()  
    X_scaled = scaler.fit_transform(X)  
    X_scaled = pd.DataFrame(X_scaled, columns=X.columns)  

    return X_scaled, y, scaler  

def train_tabnet_model(X_train, y_train, X_test, y_test):  
 
    data_config = DataConfig(  
        target=['age_category'],  
        continuous_cols=list(X_train.columns)  
    )  

    trainer_config = TrainerConfig(  
        max_epochs=500,  
        batch_size=32,  
        auto_lr_find=True,
        accelerator='gpu' if torch.cuda.is_available() else 'cpu', 
        devices=1 if torch.cuda.is_available() else None  
    )  

    num_classes = len(np.unique(y_train)) 
    model_config = TabNetModelConfig(  
        task="classification",  
        n_steps=3,  
        seed=42 
    )  

    optimizer_config = OptimizerConfig()  

    tabular_model = TabularModel(  
        data_config=data_config,  
        model_config=model_config,  
        optimizer_config=optimizer_config,  
        trainer_config=trainer_config,  
        verbose=True  
    )  

    print("开始模型训练...")  
    tabular_model.fit(  
        train=pd.concat([X_train, y_train], axis=1),  
        validation=pd.concat([X_test, y_test], axis=1)  
    )  

    return tabular_model

def main():   
    with tqdm(total=5, desc="总体进度", unit="步骤") as pbar:  
        bins = [0, 1, 3, 8, 15, 25, 40, 55, 65, 80, 105]  
        pbar.update(1)  

        print("读取数据中...")  
        df = pl.read_parquet('/data/westlake/2025-0414/aged_data.parquet')  
        pbar.update(1)  

        print("数据准备中...")  
        X, y, scaler = prepare_data_for_tabnet(df, bins)  
        pbar.update(1)  

        print("划分训练/测试集...")  
        X_train, X_test, y_train, y_test = train_test_split(  
            X, y, test_size=0.2, random_state=42, stratify=y  
        )  
        pbar.update(1)  

        print("模型训练中...")  
        model = train_tabnet_model(X_train, y_train, X_test, y_test)  
        pbar.update(1)  

        results = model.evaluate(pd.concat([X_test, y_test], axis=1))  
        print("Model Evaluation Results:", results)  

    return model  

if __name__ == "__main__":  
    main() 

and the following is the error

Image

Could you please help me figure out what's wrong?

SHIMURA0 avatar May 29 '25 01:05 SHIMURA0

Is torch.cuda.is_available() true? If not maybe your torch installation is a CPU version?

manujosephv avatar May 29 '25 16:05 manujosephv

Is torch.cuda.is_available() true? If not maybe your torch installation is a couple version? torch.cuda.is_available() is True

SHIMURA0 avatar May 30 '25 00:05 SHIMURA0

I just saw the error screenshot you posted... It looks like Pytorch is recognising multiple GPU devices but it's hitting an OOM error. Maybe try reducing batch size?

Multi GPU training has multiple strategies. Maybe you should head over to PyTorch Lightning and read up on those answers choose one which makes most sense to you. It's all supported in PyTorch Tabular

manujosephv avatar Jun 12 '25 02:06 manujosephv

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 11 '25 03:08 stale[bot]