feature_engine icon indicating copy to clipboard operation
feature_engine copied to clipboard

created new features by all categorical features combinations

Open Sandy4321 opened this issue 5 years ago • 18 comments

Is your feature request related to a problem? Please describe. if we have categorical features how to created new features by all features combinatoric combination since in real life categorical features are NOT independent , but many of them are dependent from each to others

even scikit learn can not do, but you will?

related to https://github.com/PacktPublishing/Python-Feature-Engineering-Cookbook/issues/1 Describe the solution you'd like for example maximum number of combined features is given: or 2 or 4 or 5

for pandas DF you can use concatenation https://stackoverflow.com/questions/19377969/combine-two-columns-of-text-in-dataframe-in-pandas-python

columns = ['whatever', 'columns', 'you', 'choose'] df['period'] = df[columns].astype(str).sum(axis=1)

so three features combinations from 11 features features combinatoric combination seems to be 3 nested loops are not good for this for i in range(1,11) for j in range(i+1,11) for k in range(j+1,11)

you need to get 165 new features from all combinations (not permutations ) then you get many new features

" Another alternative that I've seen from some Kaggle masters is to join the categories in 2 different variables, into a new categorical variable, so for example, if you have the variable gender, with the values female and male, for observations 1 and 2, and the variable colour with the value blue and green for observations 1 and 2 respectively, you could create a 3rd categorical variable called gender-colour, with the values female-blue for observation 1 and male-green for observation 2. Then you would have to apply the encoding methods from section 3 to this new variable ."


yes do this but it should not be necessary pandas also you need to think about RAM use, since it will be a lot of new features before creating new features think about converting categorical features to "int" types with small amount of digits from numpy ,

Sandy4321 avatar Jul 29 '20 15:07 Sandy4321

You can do it like this: df['comb_cat_feature'] = df['cat_feature_1'].astype('category').cat(df['cat_feature_2'], sep='_') If you want all possible combos than it will be something like this list(chain.from_iterable(combinations(s, r) for r in range(2, len(s)+1))) # where s is list of cat columns and than iterate over this list and use cat method from pandas But honestly, there is no need for this. CatBoost, for example, can do this on the fly (combine and encode categorical variables) but they have a lot of variables to control depth of features and memory usage. And by default it does not go deep. And if you increase max_ctr_complexity you will almost certainly get combinatorial explosion and as a result - OOM. Also if categorical variable has high cardinality, than you won't be able to save ram at all. For example, if you have more than 256 unique values, than you won't be able to label encode them and downcast to int8 to save memory without loosing information.

glevv avatar Jul 30 '20 13:07 glevv

@Sandy4321 did you try catboost?

solegalli avatar Jul 31 '20 12:07 solegalli

https://github.com/solegalli better to do this implicitly than you know exactly what is happen catbost has option for combining but then it become slow and also we do not know what is happen inside catboos

or do you mean catboost can do it and returns transitioned data as per my knowledge they do not return/share transformed data
https://github.com/GLevV good attempt but can you share full code with data used?

Sandy4321 avatar Aug 02 '20 18:08 Sandy4321

@Sandy4321 quick question on this discussion. In my experience, I have not seen that we combine all categories they way you mention here, when we are going to use this models in an organisation to score real people. This is mostly because this combinations, turn the new variables a bit difficult to understand.

Why would you be interested in doing so? Could you mention a few examples of its applicability to real life situations?

solegalli avatar Aug 13 '20 17:08 solegalli

For example scikit learn polynomial features created for continuous variables, then we need to do the same for categorical

Sandy4321 avatar Aug 14 '20 12:08 Sandy4321

https://github.com/pierrepita/categorical-data-generator

Sandy4321 avatar Aug 14 '20 12:08 Sandy4321

These combinations of categorical features could be interpreted. For example, you have two categorical variables: gender and job name. Their combination is easily interpretable - female_mle, male_devops, female_ceo and so on. It's like grouping dataframe by category and then flattening it. But yes, in practice, there is no need for this shenanigans. The only use case is Kaggle, but as I said above in the case of competition it is done manually: inspecting every categorical feature, their cardinality and usefulness of their combos and only after that writing the code that will transform your dataset. @Sandy4321 I gave you snippets of the code. You just import mentioned functions form python collections lib and run on chosen categorical features. Good luck.

glevv avatar Aug 15 '20 16:08 glevv

https://github.com/GLevV

full code needed

for example you have data frame - mydf

and you need to create new data frame newdf with all possible combinations of categorical features

Sandy4321 avatar Aug 17 '20 13:08 Sandy4321

any news?

Sandy4321 avatar Oct 01 '20 18:10 Sandy4321

Hi @Sandy4321

I agree with @GLevV that the use case for this type of variable combinations is mostly for data science competitions. And I am personally, not aware of its use in organisations.

So to prioritise this issue, we would need clear examples of situations, other than data competitions, where this type of variable combinations would be used. For example, a finance use case, or an insurance use case, or any other use case you are working on, with view of deploying the model. Could you expand on that?

solegalli avatar Oct 02 '20 07:10 solegalli

Hi @thibaultbl What are your views on this issue?

solegalli avatar Oct 02 '20 11:10 solegalli

I agree with @GLevV , most machine learning (or at least for tree-based) do that as a part of their inner algorithm.

But it is also true empirically that you can improve your metrics by using this kind of cross features. Like you said, it is mostly usefull in data science competition. Nevertheless, I think it can be usefull in organisation, if you use this kind of automated feature generation associated with a good feature selection, you can discover some hidden relation that you didn't think about.

My suggestion would be to use something with more human thinking to avoid computing all one-to-one cross feature.

columns = [(col1, col2), (col3, col4)] # 1 - Tuple of columns to combine
for a, b in columns:
    df.loc[:, a + b] = df.loc[:, a] + df.loc[:, b] # Bad example, just to get the idea

thibaultbl avatar Oct 02 '20 19:10 thibaultbl

Thank you! @thibaultbl

solegalli avatar Oct 03 '20 08:10 solegalli

It will look something like this

from itertools import combinations class FeatureCombiner(cols, level=2): self.cols = cols self.level = level # there should be a limit (like 3)

def fit(X) # need X only to check cols availability comb = self.level # or we can take list as an argument, maybe it will be better self.feature_comb_list = [] while comb != 1: feature_comb_list += [(x, y) for x, y in combinations(self.cols, comb)] comb -= 1

def transform(X): df = X.copy() for pair in feature_comb_list: df[f"{pair[0]}_{pair[1]}"] = df[pair[0]].str.cat(df[pair[1]], sep="_") # if cols are strings, otherwise -> convert to str return df

But there also need to be a lot of memory/dtype/availability checks. That's why it's easier to do it manually or use something like CatBoost. The number of features will grow exponentially (that's why we need a limit and memory checks) and many of the will be highly correlated.

glevv avatar Dec 31 '20 10:12 glevv

This transformer from category encoders may do something of the sort: https://contrib.scikit-learn.org/category_encoders/catboost.html

solegalli avatar Nov 08 '21 20:11 solegalli

YOU are mistaken nothing even close at this link https://contrib.scikit-learn.org/category_encoders/catboost.html

did you meant PolynomialWrapper as it is written For polynomial target support, see PolynomialWrapper. ??

Sandy4321 avatar Nov 08 '21 22:11 Sandy4321

Is this the transformer you are referring to: https://contrib.scikit-learn.org/category_encoders/polynomial.html?

Otherwise, could you please paste a link with a reference?

Also, please mind our code of conduct for communications through this channel: https://feature-engine.readthedocs.io/en/latest/code_of_conduct.html

solegalli avatar Nov 09 '21 06:11 solegalli

seems to it is not for interactoins as stated 1.2.3 Contrast Coding Contrast coding creates a new variable by assigning numeric weights (denoted here as w) to the levels of an ANOVA factor under the constraint that the sum of the weights equals 0

in http://psych.colorado.edu/~carey/Courses/PSYC5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf

or Orthogonal polynomial coding is a form of trend analysis in that it is looking for the linear, quadratic and cubic trends in the categorical variable. in https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/#ORTHOGONAL

you are welcome to find true feature interactions library ...

Sandy4321 avatar Nov 09 '21 15:11 Sandy4321