category_encoders
category_encoders copied to clipboard
[Question; need help; support request] Possible to join multiple CountEncoders after parallel (multiprocessing) fitting?
We have a big data frame that we want to fit into a CountEncoder. We would like to somehow make use of the multiple cores of our machine. We would do that by splitting the DF into multiple chunks and fit (among other things) the CountEncoder on the chunks.
Now, after that the single CountEncoder objects have to be joined into one big CountEncoder as if it was fitted on the whole data frame.
Can this be done? If yes, how can we do that?
this is not supported out of the box.
Are you planning to use the countencoder with normalize=True? Would it be possible to fit on a random subset only? I'd expect the results to be similar to the whole dataset.
If you want to go for the full data set you need to implement something yourself. If you fit multiple CountEncoders make sure they all use the same OrdinalEncoder (the count encoder first fits an OrdinalEncoder to encode e.g. "foo", "bar" to 1, 2 and hence standardize the input. You'd want to pass that fitted OrdinalEncoder in the init rather than fit it in the fit function. Writing a combine function that adds up the counts should be rather straight forward then.
I am sorry I haven't gotten back to you here. I'm currently not on the job for health reasons. Also for the same reason I'm not quite sure if this question is still relevant for us. Nevertheless, thanks for your response.
The problem with a random subset is that you can never be absolutely certain that it catches really all categories in a certain column. Still, some preprocessing may be helpful where we drop duplicate categories before training. That way we'd end up with a massively reduced subset while being sure to carry over every category and leaving none behind.