pagefind icon indicating copy to clipboard operation
pagefind copied to clipboard

Deduplicate results

Open nhoizey opened this issue 2 years ago • 6 comments

It would be great if we could deduplicate results, for sites where the same content can be present on different pages.

This is already something that requires a canonical for SEO (which is allowed with data-pagefind-meta="url[href]"), so maybe having a boolean option to use the result URL as a deduplication key could be enough.

nhoizey avatar Mar 07 '24 22:03 nhoizey

You can for example search for “animal” on https://nicolas-hoizey.photo/search/ and see multiple identical results.

For example, the photo “A storm is coming” is available in 3 different galleries:

  • Maasai Mara National Reserve: https://nicolas-hoizey.photo/galleries/travels/africa/kenya/maasai-mara/a-storm-is-coming/
  • Landscapes: https://nicolas-hoizey.photo/galleries/landscapes/a-storm-is-coming/
  • Elephants: https://nicolas-hoizey.photo/galleries/animals/mammals/elephants/a-storm-is-coming/

They all have the same canonical URL: https://nicolas-hoizey.photo/photos/a-storm-is-coming/ (which I configured for Pagefind, but maybe I shouldn't until it's possible to deduplicate results).

nhoizey avatar Mar 07 '24 23:03 nhoizey

Interesting! I think it would be fine for Pagefind to deduplicate these by default based on their url.

What would you expect regarding the content for these? If you tag three pages with the same url, but they have different content, what should be shown in titles and excerpts? (and what should be indexed for search?) 🤔

bglw avatar Mar 11 '24 20:03 bglw

@bglw in my specific case, title and content are the same anyway, which feels right because they share the canonical URL.

The only differences are:

  • the URL (not the canonical one)
  • the breadcrumb (similar to the folder hierarchy in URLs)

So the first item with the URL can be used.

But there might be other use cases where the choice would be different, so maybe this could be a set of values:

  • config key: deduplicate_contents
  • values
    • false (default): no deduplication
    • keep_first_indexed: the easiest
    • keep_earliest: only possible if contents have dates
    • keep_latest: same
    • concatenate: concatenate all content sharing the same URL

There might be other values in the future, which an enumeration easily allows.

nhoizey avatar Mar 11 '24 21:03 nhoizey

came here searching for this, unless I'm missing something the search doesn't seem usable without deduplication. Even if it's possible for the same article to show up once for each tag, it then also shows up a further 3 times under 'tags'? 🤔

tbh I would just expect a list of de-duped matching articles with the tags added as labels on them.

image

brokenalarms avatar Jul 07 '24 12:07 brokenalarms

@brokenalarms your screenshot is a different case, as those results aren't direct duplicates, it's just finding a match for your search term in the text of the page that lists everything from a given tag — Pagefind doesn't know that the text happens to point to a different indexed page.

The fix there is to configure the data-pagefind-body tag to include/exclude the pages that get indexed. (documentation link). By placing that tag only on your articles, the tag listing pages won't be included in the index.

bglw avatar Jul 08 '24 21:07 bglw