Check for duplicate records
Duplicate records can be a problem, could check for this? E.g. check unique rows, or make a set? Could be nasty for a large dataset though? Easy enough to experiment with some random data (presumably worst case scenario)
- function
- sklearn checker
- pandas df accessor
Hello, I would like to work on this. Can you elaborate more on what is expected?
@bhoomikaagrawal16 hello, and thanks for thinking of contributing!
I guess there's at least a couple of scenarios:
- Duplicate rows in any dataset -- not a good thing.
- Rows in data that appeared in training -- really not good.
There are 3 place I put things:
- Functions in a module like
duplicates.py-- I would start here -
sklearntransformers, both supervised and unsupervised, in ``sklearn.py` (usually trying to use functions from whatever modules) -
pandasaccessors, inpandas.py(usually trying to use functions from whatever modules)
So a good place to start might be to create a module with an experimental 'duplicate detecting' function. It needs to be fast enough to work reasonably fast on at least 100k records, as a rule of thumb.
Write simple docstrings and doctests please (see the other modules).
Does this help? Let me know if you need more.