skrub icon indicating copy to clipboard operation
skrub copied to clipboard

Min-hash at the category level

Open jeromedockes opened this issue 1 year ago • 1 comments

Problem Description

I have a column where each entry is a variable-length list or set. for example, I have a table like:

user id, purchased product id

and I group by user_id. Thus I get for each user a list of all the products the user purchased. I want to extract features to describe that information.

One simple and potentially effective option is to use min-hash, but at the level of the entries in the list (ie in this case, each product id would get hashed and we would compute the min over all the product ids purchased by a user, for each hashing function) rather than character n-grams.

Feature Description

it could be built into the minhashencoder, where if the column values are arrow or python lists or arrays it switches to that behavior. it could also be built into the agg joiner somehow.

Alternative Solutions

No response

Additional Context

came up during euroscipy 2024 discussions

jeromedockes avatar Aug 30 '24 08:08 jeromedockes

probably the easiest is to have a Hash (no min) transformer that hashes the full entry with several seeds, then this can be aggregated with the 'min' operation in the aggjoiner

jeromedockes avatar Oct 23 '24 12:10 jeromedockes