smogn
smogn copied to clipboard
Its taking more than 20h to sample the data
Hi Nick,
I am seeing huge runtime for my input data which is of 28K * 59. Its running for more than a day. I have even standardized the input data Any possible solution ?
dist_matrix: 5%|4 | 276/5671 [50:48<16:50:38, 11.24s/it]
same issue here
Same issue here. Its more than a day for 5 lakh data *32
same issue!
Me too, it's extremely slow on relatively large datasets. A cuda implementation and/or n_jobs option would be great.
I think I have a potential solution for this problem and this MIGHT work for you :
My problem was using the default settings without specifying anything
Here is my previous code that was extremely slow
dataframe_oversampled = smogn.smoter( data=dataframe, y='TARGET_VARIABLE', )
However the moment I started tinkering the parameters somehow it got 15 times faster, a code that used to take me 6 hours only took 30 minutes !
Here is how I changed my code, I hope similar tinkering will help you too.
PS : in my project I made a special function to handle all missing data because I had special cases, so the drop_na_col and drop_na_row in these parameters are just for good measure. `
Apply SMOGN to balance the dataset
dataframe_oversampled = smogn.smoter(
data=dataframe,
y='TARGET_VARIABLE',
k=9, ## positive integer (k < n)
pert=0.04, ## real number (0 < R < 1)
samp_method='balance', ## string ('balance' or 'extreme')
drop_na_col=True, ## boolean (True or False)
drop_na_row=True, ## boolean (True or False)
replace=False, ## boolean (True or False)
## phi relevance arguments
rel_thres=0.10, ## real number (0 < R < 1)
rel_method='manual', ## string ('auto' or 'manual')
# rel_xtrm_type = 'both', ## unused (rel_method = 'manual')
# rel_coef = 1.50, ## unused (rel_method = 'manual')
rel_ctrl_pts_rg=rg_mtrx ## 2d array (format: [x, y])
)
`