selfoss
selfoss copied to clipboard
Deduplicate items across sources
Hi,
selfoss only adds an item from a feed when it is not already present for that source. However, newspapers often have separate feeds for different topics. When you subscribe to multiple feeds, you can end up with the same article from multiple feeds/sources.
So it would be nice if selfoss could check whether the article is present regardless of source. This is usually ok since the ID is the URL to the article, which should be unique across sources.
I have implemented this change in behavior here, controlled by an ini parameter: https://github.com/mrichtarsky/selfoss/commit/f31bf4ff5091e8224c508200d1f42e915c921784
Would this be interesting for others as well?
Thanks and best regards, Martin
Thanks, that is interesting idea. I wonder if we could make it always enabled and have the item in multiple sources.
We would probably need to replace the source
column in the items
table with an m:n
association table. Will need to check the performance implications.
This is a very nice idea, what are you using as identifier to deduplicate? The url? What if the two feeds return a different content? Should not be an issue if you're using the full text recovery though.
what are you using as identifier to deduplicate? The url?
The UID. Most commonly, this is the post URL but it is not required. For example blogger.com will use something like tag:blogger.com,1999:blog-6112936277054198647.post-403878284366003238
.
What if the two feeds return a different content? Should not be an issue if you're using the full text recovery though.
We could have findAll
return the source
id in addition to item
id and check whether the content
and url
matches when the source id does not, and only deduplicate it then.
That would also probably resolve the uid
collisions.
The issue that items will be missing from some of the sources will still remain, though, which is why I would like to test the performance impact of having sources
table in m:n
relation to items
.