smogn icon indicating copy to clipboard operation
smogn copied to clipboard

Its taking more than 20h to sample the data

Open mhnbitece opened this issue 4 years ago • 5 comments

Hi Nick,

I am seeing huge runtime for my input data which is of 28K * 59. Its running for more than a day. I have even standardized the input data Any possible solution ?

dist_matrix: 5%|4 | 276/5671 [50:48<16:50:38, 11.24s/it]

mhnbitece avatar Oct 18 '20 09:10 mhnbitece

same issue here

maxiuw avatar Nov 25 '20 14:11 maxiuw

Same issue here. Its more than a day for 5 lakh data *32

snigdhasen avatar Apr 29 '21 13:04 snigdhasen

same issue!

luna57-lr avatar Nov 16 '21 06:11 luna57-lr

Me too, it's extremely slow on relatively large datasets. A cuda implementation and/or n_jobs option would be great.

naeemmrz avatar Jan 22 '22 18:01 naeemmrz

I think I have a potential solution for this problem and this MIGHT work for you :

My problem was using the default settings without specifying anything

Here is my previous code that was extremely slow dataframe_oversampled = smogn.smoter( data=dataframe, y='TARGET_VARIABLE', )

However the moment I started tinkering the parameters somehow it got 15 times faster, a code that used to take me 6 hours only took 30 minutes !

Here is how I changed my code, I hope similar tinkering will help you too.

PS : in my project I made a special function to handle all missing data because I had special cases, so the drop_na_col and drop_na_row in these parameters are just for good measure. `

Apply SMOGN to balance the dataset

dataframe_oversampled = smogn.smoter(
    data=dataframe,
    y='TARGET_VARIABLE',
    k=9,                    ## positive integer (k < n)
    pert=0.04,              ## real number (0 < R < 1)
    samp_method='balance',  ## string ('balance' or 'extreme')
    drop_na_col=True,       ## boolean (True or False)
    drop_na_row=True,       ## boolean (True or False)
    replace=False,          ## boolean (True or False)

    ## phi relevance arguments
    rel_thres=0.10,         ## real number (0 < R < 1)
    rel_method='manual',    ## string ('auto' or 'manual')
    # rel_xtrm_type = 'both', ## unused (rel_method = 'manual')
    # rel_coef = 1.50,        ## unused (rel_method = 'manual')
    rel_ctrl_pts_rg=rg_mtrx ## 2d array (format: [x, y])
)

`

MouadEt-tali avatar Oct 14 '23 16:10 MouadEt-tali