evalml icon indicating copy to clipboard operation
evalml copied to clipboard

Standardize how we access unique target values for classification problems

Open angela97lin opened this issue 3 years ago • 2 comments

Right now, we access unique target values for classification problems in several ways:

  1. list(ww.init_series(np.unique(y))) (classification_pipeline.py)
  2. unique_labels (confusion_matrix)
  3. LabelBinarizer / np.unique in roc_curve (slightly different than label encoding)

It could be helpful to standardize how we encode and decode targets pre and post fit time. This issue tracks finding places where we encode/decode and seeing how we could standardize this process.

Note that in some cases, we might encode/decode outside of the context of a pipeline (such as confusion_matrix), but it could still be helpful to consolidate our implementation to fewer methods if possible!

angela97lin avatar Dec 02 '21 20:12 angela97lin

@angela97lin this is an excellent story. Whenever you see things like this, feel free to keep submitting them. Is it at all worth getting people together to figure out where other occurrences happen? Or do you feel like you've captured most of them here.

chukarsten avatar Dec 08 '21 20:12 chukarsten

@chukarsten

It could be worth a quick discussion in case I've missed any, but otherwise I think this covers a fair bit of ground to tackle. I think it's also interesting to think about how we could prevent this from deviating again--obviously these different code snippets were probably added by different people at different times. Having a discussion could be helpful not only to see if we've missed any here, but bring more awareness to the team so that if anyone needs this capability in the future, they're aware of what to do!

angela97lin avatar Dec 11 '21 01:12 angela97lin