sklearn-pandas icon indicating copy to clipboard operation
sklearn-pandas copied to clipboard

Pandas In, Pandas Out? `.inverse_transform()` method

Open naught101 opened this issue 9 years ago • 31 comments

It would be really nice to have the ability to put pandas dataframes into sklearn pipelines, and to have equivalent pandas dataframes returned afterwards. I think that this module would be the place for that - probably all that would be required is a .inverse_transform method on the DataFrameMapper.

Would something like this be wanted in this module? I can make a pull request, if so.

Before I do, why is all the code in __init__.py? Seems like it'll get hard to maintain after a while...

naught101 avatar Oct 22 '15 05:10 naught101

Hi Naught101!

You can already put pandas dataframes into sklearn pipelines. Just create a pipeline where the first step is the DataFrameMapper.

Regarding the proposal "to have equivalent dataframes returned afterwards", you mean to make the pipeline return a pandas DataFrame? Sklearn pipelines usually return numpy arrays, with either classification probabilities for each class (predict_proba), directly class predictions or regression values. How could you inverse transform that with the initial DataFrameMapper? The output and the input have different shapes and useful transforms.

I believe you can do the indexing thing you proposed at https://github.com/scikit-learn/scikit-learn/issues/5523#issuecomment-150123228 just wrapping the numpy array output into a DataFrame passing as index the one from the original DataFrame you got into the pipe. Am I wrong?

Regarding the reason why all the code is in __init__.py, I guess it is because it was a very small module at first and didn't make a lot of sense to scatter the code along multiple files, although clearly we would need to go that way if the codebase grows, for clarity.

One issue we have one is that the original maintainer of the package (paulgb) is no longer working on it at all, and the second mantainer (Cal Paterson) has been quite irresponsive in the last few months as well. So it's becoming hard to get new code into this repo, and harder to get it into a release. :(

dukebody avatar Oct 22 '15 07:10 dukebody

Aha.. I wasn't thinking clearly, but now I can: DataFrameMappers can also be useful for generating the y value passed to a fit method. The inverse_transform would then be useful to get back a suitable dataframe. But yes, this would be a different DataFrameMapper to the one used for X.

I guess that would all happen outside the pipeline though..

Has anyone working on the code asked @paulgb for push access?

On 22 October 2015 6:52:05 pm AEDT, "Israel Saeta Pérez" [email protected] wrote:

Hi Naught101!

You can already put pandas dataframes into sklearn pipelines. Just create a pipeline where the first step is the DataFrameMapper.

Regarding the proposal "to have equivalent dataframes returned afterwards", you mean to make the pipeline return a pandas DataFrame? Sklearn pipelines usually return numpy arrays, with either classification probabilities for each class (predict_proba), directly class predictions or regression values. How could you inverse transform that with the initial DataFrameMapper? The output and the input have different shapes and useful transforms.

I believe you can do the indexing thing you proposed at https://github.com/scikit-learn/scikit-learn/issues/5523#issuecomment-150123228 just wrapping the numpy array output into a DataFrame passing as index the one from the original DataFrame you got into the pipe. Am I wrong?

Regarding the reason why all the code is in __init__.py, I guess it is because it was a very small module at first and didn't make a lot of sense to scatter the code along multiple files, although clearly we would need to go that way if the codebase grows, for clarity.

One issue we have one is that the original maintainer of the package (paulgb) is no longer working on it at all, and the second mantainer (Cal Paterson) has been quite irresponsive in the last few months as well. So it's becoming hard to get new code into this repo, and harder to get it into a release. :(


Reply to this email directly or view it on GitHub: https://github.com/paulgb/sklearn-pandas/issues/41#issuecomment-150136907

Sent from my Android device with K-9 Mail. Please excuse my brevity.

naught101 avatar Oct 22 '15 08:10 naught101

@calpaterson got write access to this repo, but he's not answering my mails. :S

dukebody avatar Oct 22 '15 08:10 dukebody

Hrm. Is there any reason you couldn't expand the current behaviour to also map the y dataframe? e.g. the call would be mapper = DataFrameMapper(X_features = [(blah...)], y_features = [(blergh)]), and then .fit(), .transform() and .predict() all call whichever transforms are relevant on X and/or y.

naught101 avatar Oct 22 '15 12:10 naught101

Sounds reasonable. Could you come up with some examples where this y transformation would be useful?

dukebody avatar Oct 23 '15 09:10 dukebody

@naught101 I have write access now to this repo so we can work this out if you come out with useful use cases. :)

dukebody avatar Nov 02 '15 08:11 dukebody

@naught101 you might want something similar to what is discussed in https://github.com/paulgb/sklearn-pandas/issues/13 ?

dukebody avatar Nov 02 '15 09:11 dukebody

Yeah, I suspect that #13 is a prerequisite for this issue..

naught101 avatar Nov 03 '15 00:11 naught101

If say the transformed dataframe has exactly the same shape as the dataframe before the transformation. Can we pass in the columns to regenerate the predicted results in a DataFrame format?

ethanluoyc avatar Nov 08 '15 03:11 ethanluoyc

@ethanluoyc Could you provide a code example of how that feature would work? Not the implementation, but how one would use it.

dukebody avatar Nov 08 '15 13:11 dukebody

I am doing something on basketball so I will just give an exmaple on this say I have this dataframe, screenshot 2015-11-08 22 02 01

after the conversion I will get something like this.

screenshot 2015-11-08 22 03 52

Which basically did substitution on based on the position of the keyword (which is the name) I have in a text string, for example,

"Jumpball: (Zydrunas Ilgauskas)\PN vs. (Kendrick Perkins)\PN ((Mo Williams)\PN gains possession)"

So the two dataframes actually has the same shape. I don't know whether I can do such inverse transformation.

I checked out #13 and I think the approach can work, however, as I referenced on the documentation on sklearn I stumble about their docs on the attribute active_features_, I decided to look into that in more details once I figure out what teh active_features_ attribute does.

ethanluoyc avatar Nov 08 '15 14:11 ethanluoyc

I believe we can do the inverse transformation if we: * Track which array columns correspond to each dataframe columns. * Every transformer used has an inverse_transform(X) method.

It shouldn't be too hard to do. Any takers? :)

dukebody avatar Nov 08 '15 16:11 dukebody

Can sklearn-pands inverse_transform the transformed data right now ?

Yevgnen avatar Mar 03 '17 03:03 Yevgnen

No, it can't right now.

dukebody avatar Mar 03 '17 10:03 dukebody

Last intent to do this was https://github.com/pandas-dev/sklearn-pandas/pull/56 but it stalled waiting for input from other dev. Perhaps we can retake it?

dukebody avatar Aug 20 '17 11:08 dukebody

Am I right that this feature should be something like:

df = pd.DataFrame({'colA': list('ynyyn'), 'colB': list('abcab')})
mapper = DataFrameMapper([
    ('colA', [LabelEncoder()]),
    ('colB', [LabelEncoder()]),
])
transformed = mapper.fit_transform(df)
restored = mapper.inverse_transform(transformed)

Where transformed will be something like:

np.array([[0, 0],
          [1, 1],
          [0, 2],
          [0, 0],
          [1, 1]])

And, restored is the original dataframe:

colA colB
   y    a
   n    b
   y    c
   y    a
   n    b

So, basically, the DataFrameMapper will be able to "rollback" the result into original dataframe like sklearn transformers do?

devforfu avatar Oct 21 '17 06:10 devforfu

@devforfu yes, this is what I understand.

To do so we need to keep track of which columns correspond to which features in the transformed output, and then run the transformer inverse on each block.

dukebody avatar Oct 22 '17 18:10 dukebody

Hi all, I've worked on a fork to create a solution for this problem. It passes the test

def test_inverse_transform_multicolumn():
    df = pd.DataFrame({'colA': list('ynyyn'), 'colB': list('abcab'), 'colC': list('sttts')})
    mapper = DataFrameMapper([
        ('colA', LabelEncoder()),
        ('colB', LabelBinarizer()),
        ('colC', LabelEncoder()),
    ])

    transformed = mapper.fit_transform(df)
    restored = mapper.inverse_transform(transformed)

    assert isinstance(restored, pd.DataFrame)    
    assert restored.equals(df)

which includes a LabelBinarizer that generates multiple columns. So far I'm assuming the mapper takes a pandas data frame and outputs a numpy array; I'm not yet dealing with self.input_df.

I'd like to improve this solution (I've now included an extra self.transformed_cols_ to keep track of mapped columns, but that should ideally be integrated with self.transformed_names_. However I haven't yet checked the implications of modifying the latter, so that's why I've simply added the parameter for now.

What would be the next steps? I've no idea if somebody else is already working on this, but I'm assuming I'll update my solution, commit it to my fork and then click on 'pull request' in my forked repository on GitHub? Do I need to keep anything else in mind?

erikjandevries avatar Nov 07 '17 16:11 erikjandevries

@erikjandevries I guess you only need to run tox to see if all tests pass. Probably, add a couple more tests to see if your implementation correctly handles other cases, e.g. several transformers, like:

mapper = DataFrameMapper([
    ('colA', [CategoricalImputer(), LabelEncoder()])
    ('colB', [Imputer(), StandardScaler()])
    # other transformers
])

Or maybe any other edge cases.

Then, if everything is fine, you could make a pull request and wait for a review from the repo owners. (As well as response from Circle CI which could show if your implementation has any issues).

devforfu avatar Nov 12 '17 06:11 devforfu

interested to see if there's been any progress on this issue. Seems like a pretty major limitation to not be able to recover the original data after transformation.

Whamp avatar Jul 06 '18 14:07 Whamp

is there any issue with @erikjandevries code here? looks fine to me but hasn't been accepted

https://github.com/scikit-learn-contrib/sklearn-pandas/pull/133/commits/1b4edd9e9a7de56a25259b288150d06ece9701fd

Whamp avatar Jul 10 '18 22:07 Whamp

I'm very sorry, I'm busy lately with other stuff in my life and haven't managed to review this... Would any of you be interested in becoming a project admin with merge rights?

El dc., 11 jul. 2018 , 00:27, Whamp [email protected] va escriure:

is there any issue with @erikjandevries https://github.com/erikjandevries code here? looks fine to me but hasn't been accepted

1b4edd9 https://github.com/scikit-learn-contrib/sklearn-pandas/commit/1b4edd9e9a7de56a25259b288150d06ece9701fd

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/sklearn-pandas/issues/41#issuecomment-403985850, or mute the thread https://github.com/notifications/unsubscribe-auth/AACj4QoHVafFR6lgMi9Cdmq27KbFW2OWks5uFSo7gaJpZM4GTgzi .

dukebody avatar Jul 11 '18 05:07 dukebody

I'm sorry to say I've also been very busy. If I'm not mistaken the problem with my code was that I created a new variable self.transformed_cols_ where I should have used the existing self.transformed_names_ I did this since I wasn't sure what I might break otherwise or I wasn't sure how to use the transformed names variable... It's been a long time, I think I found another way around for the problem I was dealing with at the time, but perhaps the update could still be useful.

https://github.com/scikit-learn-contrib/sklearn-pandas/pull/133

erikjandevries avatar Jul 11 '18 05:07 erikjandevries

@dukebody I usually track the sklearn_pandas repository changes and pull-requests and use it in my daily tasks so I could work on this if nobody else decides to take this responsibility.

devforfu avatar Jul 11 '18 06:07 devforfu

@devforfu Thanks! I've sent you an invite to become collaborator with write access to this repo, so you can merge stuff. Do you have an account in Pypi so I can give you access to publish new releases there?

dukebody avatar Aug 05 '18 16:08 dukebody

@dukebody Sure, not a problem! Yes, I've created one, the username is devforfu.

devforfu avatar Aug 06 '18 14:08 devforfu

@devforfu Added you to pypi. I guess you should have received some kind of notification about it.

Can you take care of managing next release after working out existing PRs?

dukebody avatar Aug 15 '18 12:08 dukebody

@dukebody Yes, the notification was received.

Ok, sure, will do as soon as finalize the pending changes.

devforfu avatar Aug 15 '18 14:08 devforfu

Hello guys. Any update on this issue?

AlanGanem avatar May 01 '20 19:05 AlanGanem

I am joining @AlanGanem: Is there any update? I can see some updates in #133 and #182 , but it's already been more than 1 year and nothing was approved and merged.

sxooler avatar May 13 '20 08:05 sxooler