Flym icon indicating copy to clipboard operation
Flym copied to clipboard

Filter double items

Open M1Aston opened this issue 4 years ago • 3 comments

Would it be possible to extend the filter for double items a bit. At the moment it only deals with titles that are exactly the same. But sometimes they are very similar (but not identical). Could you add a filter for these situations as well? Here are two examples: https://ibb.co/tQGkCgt

M1Aston avatar May 11 '20 05:05 M1Aston

I agree it would be really useful. Do you have an idea on how to write this filter? It is dangerous if the filter is not strict enough.

Off the top of my head, I can find two ways to do it:

  • Remove item whose name is included in the name of another item name. Could work for edits, but I think there will be a lot of false positives. Especially if someone publishes items with short names.
  • Use the Levenshtein distance (edit distance). For example, remove a feed if the Levenshtein distance to another feed is smaller than 10% of the length of the title.

Or maybe combine both ?

lavendthomas avatar May 11 '20 07:05 lavendthomas

Well, you're probably asking too much for me to be able to answer. :-) I don't know. Perhaps filtering based on a similar word count?

A minimum number of x identical words (x is determined by the user) would be seen as identical feeds. Perhaps combined with an exclusion for (too) short words (a, it, the...).

M1Aston avatar May 11 '20 07:05 M1Aston

It appears that the even the filter as it is now doesn't work properly. Take a look at these two screenshots. One with a double\identical feed and one with three identical feeds(!). https://ibb.co/RhMZJVD

M1Aston avatar May 15 '20 05:05 M1Aston