texthero icon indicating copy to clipboard operation
texthero copied to clipboard

Add drop_duplicates

Open jbesomi opened this issue 4 years ago • 4 comments

(Edit)

Add hero.drop_duplicates(s, representation, distance_algorithm, threshold).

Where:

  • s is a Pandas Series
  • representation is either a Flair embedding or a hero representation function. Need to define a default value.
  • distance_algorithm is either a string or a function that takes as input two vectors and it computes their distance. Example of such a function is sklearn.metrics.pairwise.euclidean_distances (see scikit-learn repository)
  • threshold boolean values. All vectors that share a distance less than this value will be considered as a single document. The first in order of appearance of the Pandas Series will be kept.

Task: Drop all duplicated from the given Pandas Series and return a cleaned version of it.

TODO: It will be interesting to drop_duplicates from a DataFrame, specifying which column to drop (as done in Pandas).

jbesomi avatar Apr 26 '20 11:04 jbesomi

@jbesomi Should it be checking line by line and if a line is duplicate remove it? or no removal just state that there is duplicates.

selimelawwa avatar May 16 '20 15:05 selimelawwa

The idea here is to compare long text of documents and try to find if there are some of them too similar; in this case, it might mean that documents are indeed duplicates. There are many applications for that, for instance, to detect plagiarism in papers.

A naive approach is to apply TF-IDF and look at the distance between vectors.

jbesomi avatar May 16 '20 15:05 jbesomi

conI suggest having several methods for handling duplicated content.

In the very simplest form, you might just need to chech againsta a hash (sha1, for instance) to be sure you don't have exact duplicates (ok, this might be a preprocessing job).

The inteface might look like Pandas.Series.unique() but specifying a method / way to do the deduplication: unique( method='hash | jaquard | etc.' , threshold=xx).

igponce avatar Jul 08 '20 10:07 igponce

Hey @igponce,

Exactly, the interface would look like hero.unique(df['text]).

A simple-yet-powerful solution is to simply compute a good representation of each text and remove documents that have very similar vectors.

Right, as you point the function will have as argument threshold. We will need to do some tests and pick a good one as default. This will largely depend on the underline algorithm.

Would you be interested in implementing this solution? Jaccard might work as well but it's easier to do better and to use word vectors instead of just counting.

Food for thoughts: what if the input must already be a representation? This would be even a better solution. In this case, the arguments might be the distance function as well as the threshold parameters.

jbesomi avatar Jul 08 '20 10:07 jbesomi