category_encoders [Question; need help; support request] Possible to join multiple CountEncoders after parallel (multiprocessing) fitting?

[Question; need help; support request] Possible to join multiple CountEncoders after parallel (multiprocessing) fitting?

Open HWiese1980 opened this issue 1 year ago • 1 comments

We have a big data frame that we want to fit into a CountEncoder. We would like to somehow make use of the multiple cores of our machine. We would do that by splitting the DF into multiple chunks and fit (among other things) the CountEncoder on the chunks.

Now, after that the single CountEncoder objects have to be joined into one big CountEncoder as if it was fitted on the whole data frame.

Can this be done? If yes, how can we do that?

Jun 04 '24 06:06 HWiese1980

this is not supported out of the box. Are you planning to use the countencoder with normalize=True? Would it be possible to fit on a random subset only? I'd expect the results to be similar to the whole dataset. If you want to go for the full data set you need to implement something yourself. If you fit multiple CountEncoders make sure they all use the same OrdinalEncoder (the count encoder first fits an OrdinalEncoder to encode e.g. "foo", "bar" to 1, 2 and hence standardize the input. You'd want to pass that fitted OrdinalEncoder in the init rather than fit it in the fit function. Writing a combine function that adds up the counts should be rather straight forward then.

Jun 04 '24 20:06 PaulWestenthanner

I am sorry I haven't gotten back to you here. I'm currently not on the job for health reasons. Also for the same reason I'm not quite sure if this question is still relevant for us. Nevertheless, thanks for your response.

The problem with a random subset is that you can never be absolutely certain that it catches really all categories in a certain column. Still, some preprocessing may be helpful where we drop duplicate categories before training. That way we'd end up with a massively reduced subset while being sure to carry over every category and leaving none behind.

Oct 02 '24 11:10 HWiese1980

category_encoders category_encoders copied to clipboard

[Question; need help; support request] Possible to join multiple CountEncoders after parallel (multiprocessing) fitting?

category_encoders
category_encoders copied to clipboard