feature_engine icon indicating copy to clipboard operation
feature_engine copied to clipboard

(In progress) Backoff binning rare labels in CountFrequency encoder

Open SangamSwadiK opened this issue 2 years ago • 6 comments

Reference Issues/PRs: Issue : #429 (Add functionality for threshold)

What does this implement/fix? Adds functionality to group categories with few observations.

Hi, @solegalli I've added functionality for Backoff binning, threshold is a list that takes values(threshold) for each feature mentioned and bins them accordingly. I haven't added the tests yet but will add them.

Please let me know any improvements that you think could be done here. I'll implement them.

SangamSwadiK avatar Jun 15 '22 15:06 SangamSwadiK

Isn't this what RareLabelEncoder is for? If I understand it correctly, this would have the same effect as putting both of them in scikit-learn pipeline - or am I missing some detail?

david-cortes avatar Jun 22 '22 17:06 david-cortes

Isn't this what RareLabelEncoder is for? If I understand it correctly, this would have the same effect as putting both of them in scikit-learn pipeline - or am I missing some detail?

yep, it is.

solegalli avatar Jun 22 '22 17:06 solegalli

@solegalli I've implemented the changes, I've added transformer and inverse transformer(no functionality) and tests. Please let me know if any changes are to be made.

Thanks.

SangamSwadiK avatar Jun 23 '22 16:06 SangamSwadiK

Hi @solegalli , please find the checklist of suggestions implemented.(made a checklist since the number of changes were more)

  • [x] Reverted back transform docstring to base_encoder's transform docstring.
  • [x] Remove threshold doc from fit().
  • [x] Redo fit() with original logic as per suggestion, reduced duplication of code.
  • [x] Redo transform().
  • [x] Preserve original layout for fit(),
  • [ ] Could not preserve original layout for raising value errors because of flake8 lint errors
  • [x] Add alternative explanation for backoff binning to make it easily understandable.

SangamSwadiK avatar Jun 25 '22 15:06 SangamSwadiK

Hi @SangamSwadiK

@david-cortes is right that the logic we are trying to implement here is already done by another transformer, the RareLabelEncoder.

So I am having second thoughts as to whether it is a good idea to make the class more complicated by adding additional logic that is already taken care of elsewhere.

Would you mind if we put this PR on hold and see what the community thinks about this?

Sorry, I know it is frustrating to work on something that does not go through eventually :_(

solegalli avatar Jul 05 '22 13:07 solegalli

Hi @SangamSwadiK

@david-cortes is right that the logic we are trying to implement here is already done by another transformer, the RareLabelEncoder.

So I am having second thoughts as to whether it is a good idea to make the class more complicated by adding additional logic that is already taken care of elsewhere.

Would you mind if we put this PR on hold and see what the community thinks about this?

Sorry, I know it is frustrating to work on something that does not go through eventually :_(

yes, its similar to Rarelabel encoder, meanwhile ill take a look at other issues. Thanks !

SangamSwadiK avatar Jul 05 '22 14:07 SangamSwadiK