Optimize PARSynthesizer's performance
Problem Description
A number of SDV users have run into performance issues when using PARSynthesizer with their data. The issues usually manifest as regular out-of-memory errors or CUDA out-of-memory errors. Other times, it just takes a long time to train the model.
I'm creating this thread to collect all of these examples from the community so the SDV core team has the context they need to understand and improve the performance of PARSynthesizer.
For anyone using SDV PARSynthesizer, please add new examples of performance issues to this thread!
Reported Example 1
Out of regular memory error
https://github.com/sdv-dev/SDV/issues/1952 by @prupireddy
RuntimeError: [enforce fail at alloc_cpu.cpp:114] data. DefaultCPUAllocator: not enough memory: you tried to allocate 683656 bytes.
"I find this particularly surprising given that I am running this on a machine with 128 GM RAM and I just restarted it."
Suggested Workaround
My recommendation would be to sample the data to reduce the footprint. You can either use less rows per sequence or try less sequences overall. Start with a much lower sample than you think you need (maybe a 5% sample of your data) and then increase by 5% each time to improve the data generated by the model.
Reported Example 2
Out of CUDA memory error
https://sdv-space.slack.com/archives/C01GSDFSQ93/p1713451980542979 by Isaac (Slack)
Use Case: PAR for forecasting time series Scale of data:
- 50k sequences
- 45 rows per sequence
- Total: ~2.2M rows
Attempted Workarounds:
- Setting lower
segment_sizeresulted in a new PyTorch error:- If I try a sequence length of 8, I get:
r.nvmlDeviceGetNvLinkRemoteDeviceType_ INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1712608853099/work/c10/cuda/driver_api.cpp":27, please report a bug to PyTorch. Can't find nvmlDeviceGetNvLinkRemoteDeviceType: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetNvLinkRemoteDeviceType
Example Code (Srini):
import numpy as np
import pandas as pd
# ID column
ids = np.arange(0, 50_000, 1)
ids = np.repeat(ids, 45)
# Sequence Index Column
ticks = np.arange(0, 45, 1)
ticks = np.tile(ticks, 50_000)
# Observations Column
obs = np.concatenate(
[np.random.normal(loc=5, scale=1, size=1) for i in ids]
)
df = pd.DataFrame(
{
"id": ids,
"ticks": ticks,
"obs": obs
}
)
from sdv.sequential import PARSynthesizer
from sdv.metadata import SingleTableMetadata
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(df)
metadata.update_column(column_name='id', sdtype='id')
metadata.set_sequence_key(column_name='id')
metadata.set_sequence_index(column_name='ticks')
synthesizer = PARSynthesizer(metadata, verbose=True)
synthesizer.fit(df)
Reported Example 1
Out of regular memory error
#1952 by @prupireddy
RuntimeError: [enforce fail at alloc_cpu.cpp:114] data. DefaultCPUAllocator: not enough memory: you tried to allocate 683656 bytes."I find this particularly surprising given that I am running this on a machine with 128 GM RAM and I just restarted it."
Suggested Workaround
My recommendation would be to sample the data to reduce the footprint. You can either use less rows per sequence or try less sequences overall. Start with a much lower sample than you think you need (maybe a 5% sample of your data) and then increase by 5% each time to improve the data generated by the model.
Bro, I recently meet the problem in example 1, how I solve this problem is to modify the segment_size from default to 5 or 10 or bigger which can decrease the calculation time. I don't know if this can help you, but it does works on my computer. And my definition of PARSynthesizer maybe like this:
""" Step1: Create the synthesizer """
synthesizer = PARSynthesizer(
metadata,
cuda = True,
verbose = True,
epochs = 512,
segment_size = 5,
sample_size = 20,
)
The explaination of segment_size is right here: https://docs.sdv.dev/sdv/sequential-data/modeling/parsynthesizer#:~:text=segment_size,into%20any%20segments.
Hope this can help you.
commenting here to support these suggestion - I run out of memory with 1 mil rows/10 features on a 40 GB GPU is there a reason that there isn't a batch size (is there one and I missed it?) Obviously you can always subsample the data, but this gets more complicated if it has to be done as part of a pipeline, especially if the data is severely unbalanced