redflag Check for duplicate records

Duplicate records can be a problem, could check for this? E.g. check unique rows, or make a set? Could be nasty for a large dataset though? Easy enough to experiment with some random data (presumably worst case scenario)

function
sklearn checker
pandas df accessor

Sep 23 '23 10:09 kwinkunks

Hello, I would like to work on this. Can you elaborate more on what is expected?

Oct 26 '23 10:10 bhoomikaagrawal16

@bhoomikaagrawal16 hello, and thanks for thinking of contributing!

I guess there's at least a couple of scenarios:

Duplicate rows in any dataset -- not a good thing.
Rows in data that appeared in training -- really not good.

There are 3 place I put things:

Functions in a module like duplicates.py -- I would start here
sklearn transformers, both supervised and unsupervised, in ``sklearn.py` (usually trying to use functions from whatever modules)
pandas accessors, in pandas.py (usually trying to use functions from whatever modules)

So a good place to start might be to create a module with an experimental 'duplicate detecting' function. It needs to be fast enough to work reasonably fast on at least 100k records, as a rule of thumb.

Write simple docstrings and doctests please (see the other modules).

Does this help? Let me know if you need more.

Oct 27 '23 13:10 kwinkunks