pytorch_tabular
pytorch_tabular copied to clipboard
Seeking for multi-GPU training
Hello, I am wondering if there is some way to setup multi-GPU on a single machine for training because my data cannot be loaded on a single GPU (32GB). Thanks in advance!
To set up multi-GPU training on a single machine using PyTorch Lightning with PyTorch Tabular, you can configure the TrainerConfig with the accelerator and devices_list parameters. Here's how you can do it:
from pytorch_tabular import TrainerConfig
trainer_config = TrainerConfig(
accelerator='gpu', # Use 'gpu' to enable GPU training
devices_list=[0, 1] # Specify the GPUs you want to use, e.g., GPU 0 and 1
)
This configuration will allow you to train your model using multiple GPUs, which can help when your data cannot be loaded on a single GPU with 32GB of memory. Ensure your machine has multiple GPUs available and that they are correctly set up for use with PyTorch Lightning [1].
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other
this does not work, ValidationError: Value '[0, 1]' of type 'list' could not be converted to Integer
By default, PyTorch Tabular uses all GPUs available. Is it not happening in your case? If you have two GPUs, it should be used.
By default, PyTorch Tabular uses all GPUs available. Is it not happening in your case? If you have two GPUs, it should be used.
The following is my training script
import polars as pl
import pandas as pd
import numpy as np
import torch
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from typing import List
from pytorch_tabular.models import TabNetModelConfig
from pytorch_tabular import TabularModel
from pytorch_tabular.config import DataConfig, TrainerConfig, OptimizerConfig
# 添加 tqdm 进度条
from tqdm import tqdm
def remove_all_none_columns(df: pl.DataFrame) -> pl.DataFrame:
"""
移除所有值都为 None 的列,并显示进度条
"""
columns = df.columns
# 使用 tqdm 显示移除 None 列的进度
non_none_columns = []
for col in tqdm(columns, desc="移除空列", unit="列"):
if not df[col].is_null().all():
non_none_columns.append(col)
return df.select(non_none_columns)
def custom_age_binning(age: float, bins: List[int]) -> int:
"""
根据自定义的年龄分组方案对年龄进行分类
"""
if age < bins[0]:
return 0
if age >= bins[-1]:
return len(bins) - 2
for i in range(len(bins) - 1):
if bins[i] <= age < bins[i+1]:
return i
return len(bins) - 2
def prepare_data_for_tabnet(df: pl.DataFrame, bins: List[int]):
df = remove_all_none_columns(df)
print("转换为 Pandas DataFrame...")
pdf = df.to_pandas()
pdf = pdf.drop(columns=['sampleid'] if 'sampleid' in pdf.columns else [])
print("年龄分箱中...")
pdf['age_category'] = list(tqdm(
pdf['age'].apply(lambda x: custom_age_binning(x, bins)),
desc="年龄分箱",
unit="样本"
))
X = pdf.drop(columns=['age', 'age_category'])
y = pdf['age_category']
print("特征标准化中...")
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)
return X_scaled, y, scaler
def train_tabnet_model(X_train, y_train, X_test, y_test):
data_config = DataConfig(
target=['age_category'],
continuous_cols=list(X_train.columns)
)
trainer_config = TrainerConfig(
max_epochs=500,
batch_size=32,
auto_lr_find=True,
accelerator='gpu' if torch.cuda.is_available() else 'cpu',
devices=1 if torch.cuda.is_available() else None
)
num_classes = len(np.unique(y_train))
model_config = TabNetModelConfig(
task="classification",
n_steps=3,
seed=42
)
optimizer_config = OptimizerConfig()
tabular_model = TabularModel(
data_config=data_config,
model_config=model_config,
optimizer_config=optimizer_config,
trainer_config=trainer_config,
verbose=True
)
print("开始模型训练...")
tabular_model.fit(
train=pd.concat([X_train, y_train], axis=1),
validation=pd.concat([X_test, y_test], axis=1)
)
return tabular_model
def main():
with tqdm(total=5, desc="总体进度", unit="步骤") as pbar:
bins = [0, 1, 3, 8, 15, 25, 40, 55, 65, 80, 105]
pbar.update(1)
print("读取数据中...")
df = pl.read_parquet('/data/westlake/2025-0414/aged_data.parquet')
pbar.update(1)
print("数据准备中...")
X, y, scaler = prepare_data_for_tabnet(df, bins)
pbar.update(1)
print("划分训练/测试集...")
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
pbar.update(1)
print("模型训练中...")
model = train_tabnet_model(X_train, y_train, X_test, y_test)
pbar.update(1)
results = model.evaluate(pd.concat([X_test, y_test], axis=1))
print("Model Evaluation Results:", results)
return model
if __name__ == "__main__":
main()
and the following is the error
Could you please help me figure out what's wrong?
Is torch.cuda.is_available() true? If not maybe your torch installation is a CPU version?
Is
torch.cuda.is_available()true? If not maybe your torch installation is a couple version? torch.cuda.is_available() is True
I just saw the error screenshot you posted... It looks like Pytorch is recognising multiple GPU devices but it's hitting an OOM error. Maybe try reducing batch size?
Multi GPU training has multiple strategies. Maybe you should head over to PyTorch Lightning and read up on those answers choose one which makes most sense to you. It's all supported in PyTorch Tabular
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.