redflag icon indicating copy to clipboard operation
redflag copied to clipboard

Check for duplicate records

Open kwinkunks opened this issue 2 years ago • 2 comments

Duplicate records can be a problem, could check for this? E.g. check unique rows, or make a set? Could be nasty for a large dataset though? Easy enough to experiment with some random data (presumably worst case scenario)

  • function
  • sklearn checker
  • pandas df accessor

kwinkunks avatar Sep 23 '23 10:09 kwinkunks

Hello, I would like to work on this. Can you elaborate more on what is expected?

bhoomikaagrawal16 avatar Oct 26 '23 10:10 bhoomikaagrawal16

@bhoomikaagrawal16 hello, and thanks for thinking of contributing!

I guess there's at least a couple of scenarios:

  • Duplicate rows in any dataset -- not a good thing.
  • Rows in data that appeared in training -- really not good.

There are 3 place I put things:

  • Functions in a module like duplicates.py -- I would start here
  • sklearn transformers, both supervised and unsupervised, in ``sklearn.py` (usually trying to use functions from whatever modules)
  • pandas accessors, in pandas.py (usually trying to use functions from whatever modules)

So a good place to start might be to create a module with an experimental 'duplicate detecting' function. It needs to be fast enough to work reasonably fast on at least 100k records, as a rule of thumb.

Write simple docstrings and doctests please (see the other modules).

Does this help? Let me know if you need more.

kwinkunks avatar Oct 27 '23 13:10 kwinkunks