smogn icon indicating copy to clipboard operation
smogn copied to clipboard

smogn time complexity

Open Ahmedkoptan opened this issue 5 years ago • 7 comments

I am running the smogn.smoter function on a dataset of size [77955, 4], and those 4 columns include the target variable Y. Y is a continuous r.v. that follows a skewed distribution where it covers the range [0, 1.8] and 55000 of the training instances lie in the range [0,0.07].

However, the function hasn't finished executing for a long time, so I was wondering what is the complexity of that algorithm?

Ahmedkoptan avatar Dec 31 '19 17:12 Ahmedkoptan

I was wondering exactly the same thing right now. I am trying to use the algorithm with a dataset of the following shape (327669, 37) and it seems that the execution will never end. Anyway I had a look at the code and probably the cause of this can be the number of "for" that it contains. I did not look at the code in deep but I think that something can be done to improve the performances (in terms of speed) of the algorithm

informatica92 avatar Feb 26 '20 15:02 informatica92

I apologize for the latency. I have not recently been able to provide support for the package. However, I do plan on releasing new improvements, which will increase the algorithms performance, usability, and flexibility.

I have considered your feedback and would like to thank you for providing it. In the meantime, I have included a progress bar to provide some indication of the time to task completion (version 0.1.1). Please let me know if there is anything else that I could do to improve the package for you.

Thank you again for using this Python implementation of SMOGN. I'm happy to hear that it useful and is being utilized.

nickkunz avatar Mar 24 '20 03:03 nickkunz

Hi Nick,

I have tried to use your library for a data set of 30K * 59 . But even the basic code takes almost a day. How to make it faster ?

mhnbitece avatar Oct 17 '20 05:10 mhnbitece

I had the same issue-- standardizing the output sped up the process by a factor of ~100.

On Sat, Oct 17, 2020, 1:50 AM mhnbitece [email protected] wrote:

Hi Nick,

I have tried to use your library for a data set of 30K * 59 . But even the basic code takes almost a day. How to make it faster ?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nickkunz/smogn/issues/1#issuecomment-710756041, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFTOOHRI5WL34QH6WROXRGDSLEWBRANCNFSM4KBW622Q .

astrogilda avatar Oct 17 '20 07:10 astrogilda

Do we need to standardize Input variables and target variables as well ?

mhnbitece avatar Oct 17 '20 11:10 mhnbitece

i am facing the same issue. Its kind of never ending execution for 5 lakh data

snigdhasen avatar Apr 29 '21 13:04 snigdhasen

I'm having the same limitation. This code takes a a long long time to execute. I'm trying to use it for some cross validation work, which means putting this in the pipeline and running it like 10 different times. Ends up being unmanageably long.

pavelkomarov avatar Jul 13 '21 20:07 pavelkomarov