v2 icon indicating copy to clipboard operation
v2 copied to clipboard

Deduplicate Feature

Open dslovin opened this issue 3 years ago • 10 comments

Occasionally, I get duplicate entries of the same feed due reading a feed at the source as well as an aggregator like hackernews. I would love to be able to dedupe based on the following fields:

  1. Link
  2. Title
  3. (bonus) Similar titles

(edit for spelling)

dslovin avatar Sep 14 '20 14:09 dslovin

Some sites have original posts news and copied news, when I subscribed these feeds, I always saw duplicate articles in several feeds . I hope to deduplicate similar entries in multiple or all deeds.

When adding a new entry, miniflux checks recent old entries and calculate similarity, if there is an entry reached the configured threshold, the new entry is marked removed or read.

For similarity calculation, maybe we can first split words and use Cosine similarity, or simply use equals. Users can configure how to calculate similarity, title or content.

moonheart avatar Feb 11 '22 08:02 moonheart

Hopefully the "Mark as Read" option is available. That's what I manually do anyway.

somini avatar Feb 13 '22 00:02 somini

Since Miniflux relies on PostgreSQL, maybe something like the pg_trgm extension is useful: https://www.postgresql.org/docs/current/pgtrgm.html?

nblock avatar Jun 16 '22 17:06 nblock

This would be an awesome feature. A lot of times a writer writes for his own blog and then reposts somewhere else, but it's in the same category of Miniflux with the same title. If it were to remove either entry (preferably keeping the first), then that would be awesome.

ajtatum avatar Aug 02 '22 04:08 ajtatum

Came here with a slightly different (but related) problem: Some of my feeds - largely big newspapers, re-publish the same articles over time. This particularly applies to essays, I think they want to push it a number of times so their website appears "more active", without adding any new information. But it is frustrating to see the same posts popping up again and again, it is wasting my time.

I was wondering whether a deduplication feature could also have some temporal comparison check such as "The same article heading was published 1 month ago, 2 years ago etc." to then get hidden from standard view.

Functional wise, it would be pretty similar: One needs a persistent table with headings (and timestamps) to check against in Postgres.

Sieboldianus avatar Feb 11 '23 06:02 Sieboldianus

I don't know how it is developed, but ttrss has such a deduplicate feature. Maybe it can help to develope such a ffeature for Miniflux too!

sonor3000 avatar Apr 22 '23 21:04 sonor3000