MIDASpy icon indicating copy to clipboard operation
MIDASpy copied to clipboard

Minimum and maximum value arguments (constraints)

Open ThirstyGeo opened this issue 4 years ago • 15 comments

I'm working with Dirichlet distributions and the compositional data simplex, and am really enjoying MIDASpy's flexibility when dealing with this data (related to K-L divergence in the decoder). However, there is a tendency to produce negative values in the numerical feature data I have been using.

In the case of compositional data, there is a constraint of zero as a minimum value. Other imputation approaches allow setting maximum and minimum value arguments (e.g., Scikit-Learn) and importantly these can be set per feature (autoimpute). Is this an argument which could be added to the package? It would be a major help to people working in several disciplines.

ThirstyGeo avatar Feb 18 '21 04:02 ThirstyGeo

Thanks @ThirstyGeo for raising this issue -- completely agree that it would be a really useful feature. The best way to implement this is probably to allow users to change the activation functions for specific output nodes in the network -- then the model will incorporate this range trimming within training itself.

We will look into this as a priority, and if you had any further suggestions/pull requests they'd be greatly received.

tsrobinson avatar Feb 18 '21 11:02 tsrobinson

That's great @tsrobinson! Much appreciated to focus on this. I'll think a bit more through the typical workflows and see if I can create a which represents a typical situation. If you like it, it could be something for the package's examples/tutorials

ThirstyGeo avatar Feb 18 '21 15:02 ThirstyGeo

As a tangent of interest - few research articles are present which relate to imputation of data in the compositional data Simplex. The best one I'm aware of for Deep Learning oriented research for imputing compositional data relates to the specific case of 'censored zeroes', i.e., the values which are below analytical detection and above zero (the only information usual given is that the values are below a certain threshold). The article focusses on ANNs, and has a focus on feature pre-processing (using log-ratio transformations on the features, to move them out of the Simplex and into Euclidean space).

The autoencoder approach of MIDASpy has the significant potential advantages of (1) allowing mixed data types, (2) not requiring a pre-processing step, (3) producing multiple realisations and therefore a measure of confidence for imputed values. Very exciting!

ThirstyGeo avatar Feb 19 '21 19:02 ThirstyGeo

Really interesting - thanks @ThirstyGeo for letting us know about this research.

ranjitlall avatar Feb 19 '21 21:02 ranjitlall

Hello and thank you for this great package. I wanted to inquire whether you have had any progress on this issue? We have a data set with a lot of count data variables, and many of them get imputed with negative values, which isn't ideal. Hence, our interest :)

geraldine28 avatar Jul 23 '21 13:07 geraldine28

Any news on this? Or maybe a small idea on how or where this would fit best in the code if i were to toy around with it myself? :)

kblnig avatar Feb 18 '23 13:02 kblnig

Hi @geraldine28 @kblnig, we are looking into this now and will get back to you shortly. Sorry about the delay!

ranjitlall avatar Feb 23 '23 17:02 ranjitlall

@ranjitlall - really looking forward to this :) !!!!

kblnig avatar Mar 02 '23 21:03 kblnig

Echoing others' enthusiasm, I'm also wondering if there's any news on this feature

martin18d avatar Apr 23 '24 21:04 martin18d

Looking forward to this feature!

AuSpotter avatar Apr 23 '24 22:04 AuSpotter

Thanks everyone for your interest! I can confirm this is now under development, and will update you asap when this functionality is ready for release.

tsrobinson avatar Apr 27 '24 10:04 tsrobinson