smogn icon indicating copy to clipboard operation
smogn copied to clipboard

Kulik d proj patch 1

Open KulikDM opened this issue 2 years ago • 5 comments

Deleted dist_metrics.py as updated metric calculations will be done internally in over_sampling.py. Deleted old versions of smoter.py and over_sampling.py to be replaced.

New smoter.py has added arguments for over_sampling.py ("metric" and "metric_args"). These pertain to the new implementation of native distance calculations now done in over_sampling.py.

New over_sampling.py can now do all distance calculations natively and does them almost in real time compared to the long run times due to the local python for loops used in over_sampling.py and dist_metrics.py. The "metric" argument allows for different metrics to be used with metrics being implemented from scipy.spatial.distance for numerical and categorical data. And two metrics exist for heterogeneous data HEOM.py and HVDM.py that have been taken from the distython repo (https://github.com/KacperKubara/distython). The "metric_args" can be included as a dict for additional scipy.spatial.distance metric functions arguments.

Comparison of previous and updated runs on the included data "housing.csv" as well as private numerical and catagorical data show that all the different metrics are comparable to the previous versions done in smogn within marginal differences (between 1e-12 - 1e-14) caused by different floating point errors in non-native functions.

Note scipy.spatial.distance is now a necessary import in over_sampling.py while HEOM.py and HVDM.py are included in the repository's smogn directory.

KulikDM avatar Jan 26 '22 14:01 KulikDM

Thank you for opening this PR. I will make time to review this more carefully in short time. :)

nickkunz avatar Jan 26 '22 19:01 nickkunz

Thanks, excellent! Your package has been a real help with a current project but I needed it to get results faster. This was the outcome; hope it works out.

KulikDM avatar Jan 26 '22 20:01 KulikDM

@KulikD-proj great work. Thank you for opening this PR. I hoped to not introduce additional dependencies. Is there a convenient way to achieve these performance improvements without new libraries?

nickkunz avatar Apr 12 '22 05:04 nickkunz

@nickkunz, thank you. It can be implemented without additional libraries but it will take some effort (there are 23 distance metrics that will all have to be rewritten) and the code won't be as simple or neat as using Scipy. Scipy is also a pretty fundamental library to have, anyone that is using smogn is bound to have it already installed. But it is up to you...

KulikDM avatar Apr 13 '22 17:04 KulikDM

@KulikDM Rewriting the distance metrics seems inconvenient and a poor use of time. I'm going to review how these are implemented in SciPy.

nickkunz avatar Apr 13 '22 23:04 nickkunz

@nickkunz any new updates on this PR?

KulikDM avatar Aug 17 '22 20:08 KulikDM

@KulikDM Thank you for checking on this. I had previously approved changes to implement the random seed parameter. This PR now has a merge conflict that needs to be resolved.

nickkunz avatar Aug 17 '22 20:08 nickkunz