Min-hash at the category level
Problem Description
I have a column where each entry is a variable-length list or set. for example, I have a table like:
user id, purchased product id
and I group by user_id. Thus I get for each user a list of all the products the user purchased. I want to extract features to describe that information.
One simple and potentially effective option is to use min-hash, but at the level of the entries in the list (ie in this case, each product id would get hashed and we would compute the min over all the product ids purchased by a user, for each hashing function) rather than character n-grams.
Feature Description
it could be built into the minhashencoder, where if the column values are arrow or python lists or arrays it switches to that behavior. it could also be built into the agg joiner somehow.
Alternative Solutions
No response
Additional Context
came up during euroscipy 2024 discussions
probably the easiest is to have a Hash (no min) transformer that hashes the full entry with several seeds, then this can be aggregated with the 'min' operation in the aggjoiner