pattern-directory icon indicating copy to clipboard operation
pattern-directory copied to clipboard

Remove duplicates from pattern results

Open iandunn opened this issue 4 years ago • 5 comments

When merged, #307 will change the results to include both the site/user locale, and en_US as a fallback. That has a side effect for en_* locales, where e.g., the en_AU and en_US versions of About me cards will both appear.

https://en-au.wordpress.org/patterns/search/about/

We can probably de-dupe via post_parent.

This might intersect w/ #28 because of integrating ElasticSearch.

iandunn avatar Jul 20 '21 20:07 iandunn

For a browsing example, https://api.wordpress.org/patterns/1.0/?pattern-keywords=11&wp-version=5.8&locale=en_GB&per_page=100 shows two Heading patterns: ID 184 for en_US, and 1512 for en_GB.

iandunn avatar Jul 20 '21 20:07 iandunn

It seems like this should also de-dupe en_US patterns if a translated version exists. i.e., we should remove Three columns of text from https://api.wordpress.org/patterns/1.0/?pattern-keywords=11&wp-version=5.8&locale=es_MX&per_page=100, because it has 'Tres columnas de texto'.

I think doing that would match user expectations, remove clutter, and avoid offending folks who are sensitive to English-centricism.

@dd32 said,

This also has the benefit that searching an English phrase would result in matching an english pattern, and not just returning an empty data set if it's not included in the locale translated patterns.

...but I'm guessing he didn't mean that we should keep en_US patterns if there is a translated version.

iandunn avatar Jul 20 '21 22:07 iandunn

@dd32 said,

This also has the benefit that searching an English phrase would result in matching an english pattern, and not just returning an empty data set if it's not included in the locale translated patterns.

...but I'm guessing he didn't mean that we should keep en_US patterns if there is a translated version.

Correct-ish. Part of the limitation of using WordPress search means that we kind of have to keep the en_US variant even if a translated variant exists. Ideally however, search would be search both/all locales and only return the most "relevant" of each pattern. When viewing an archive however, I think it should indeed deduplicate it, although that's not as easy to do with the current schema (as it can't be done at the SQL level I don't think)

As an example of a situation where one might want both returned; A company based in Switzerland is creating a Pricing page the page creator prefers German, they search for Preisgestaltung (Google translate 'pricing' in german) and get a german pricing translation. Next they go to create the French variant of the pricing page, maybe they search for Pricing or maybe they search for Preisgestaltung, but perhaps they search for Prix (Google translate 'pricing' in italian) instead so that they get a french-centric pricing page.

IMHO that suggests that perhaps we should be searching all locales, but archives should definitely be limited to the locale fallback (which presently is just $locale => US English, although maybe Australian English => British English => US English makes more sense; likewise, es_MX => es_* => en_US) and only returning a singular entry.

dd32 avatar Jul 21 '21 01:07 dd32

This modified SQL does result in de-duplicated search:

- SELECT SQL_CALC_FOUND_ROWS wp_posts.ID
+ SELECT SQL_CALC_FOUND_ROWS MAX( wp_posts.ID )
FROM wp_posts
INNER JOIN wp_postmeta ON ( wp_posts.id = wp_postmeta.post_id )
WHERE  1 = 1
       AND (( ( wp_posts.post_title LIKE '%about%' ) OR ( wp_posts.post_excerpt LIKE '%about%' ) OR ( wp_posts.post_content LIKE '%about%' ) ))
       AND (( wp_postmeta.meta_key = 'wpop_locale' AND wp_postmeta.meta_value IN ( 'en_AU', 'en_US' ) ))
       AND wp_posts.post_type = 'wporg-pattern'
       AND wp_posts.post_status = 'publish'

- GROUP  BY wp_posts.id
+ GROUP  BY if ( wp_posts.post_parent > 0, wp_posts.post_parent, wp_posts.ID )

ORDER  BY FIELD(wp_postmeta.meta_value, 'en_US', 'en_AU') DESC,
          wp_posts.post_title LIKE '%about%' DESC,
          wp_posts.post_date DESC
LIMIT  0, 10 

Results before/after:

about-me-cards-en_au
- about-me-cards
button-links-with-image-background
large-header-with-text-and-a-button
three-buttons

Didn't quite work the same for a category index, in this case the Australian english header category, but it's not a bad result

- SELECT SQL_CALC_FOUND_ROWS wp_posts.ID
+ SELECT SQL_CALC_FOUND_ROWS MAX( wp_posts.ID )
FROM wp_posts
LEFT JOIN wp_term_relationships ON ( wp_posts.id = wp_term_relationships.object_id )
INNER JOIN wp_postmeta ON ( wp_posts.id = wp_postmeta.post_id )

WHERE  1 = 1
       AND ( wp_term_relationships.term_taxonomy_id IN ( 5 ) )
       AND (( wp_postmeta.meta_key = 'wpop_locale' AND wp_postmeta.meta_value IN ( 'en_AU', 'en_US' ) ))
       AND wp_posts.post_type = 'wporg-pattern'
       AND (( wp_posts.post_status = 'publish' ))

- GROUP  BY wp_posts.ID
+ GROUP  BY if ( wp_posts.post_parent, wp_posts.post_parent, wp_posts.ID)
ORDER  BY Field(wp_postmeta.meta_value, 'en_US', 'en_AU') DESC,
          wp_posts.post_date DESC
LIMIT  0, 100 

Results:

- contact-and-social-links-en_au
- large-header-with-a-heading-en_au
large-background-image-with-title-and-description
cover-with-title-and-audio
section-with-navigation
text-column-over-fixed-background
image-with-angled-overlay-and-text
poster-layout
triptych-with-two-images-and-a-text-area
event
- contact-and-social-links
+ contact-and-social-links-en_au
title-area-with-image-and-heading
header-with-halftone-pattern
header-with-hatched-pattern
header-area-with-geometric-pattern
full-width-image-of-space-with-big-headline
media-text-in-a-full-height-container
media-text-with-image-on-the-right
media-text-with-image-on-the-left
large-header-with-text-and-a-button
large-header-with-left-aligned-text
- large-header-with-a-heading
+ large-header-with-a-heading-en_au
large-header-with-a-heading-and-a-button

Explanation of the SQL change:

  • It's assumed that a translation will always be created after the original, so the translated post_id should be higher than that of the parent. So using MAX( ID ) works to find that.
  • By grouping conditionally, it can group post id 3 (parent: 1) and post_id 1 together, resulting in only post id 3 showing up in the results.

I'm not sure why it's affecting the archive in a different way than the search results, I'm guessing the query change is causing the ORDER BY FIELD() to be ignored, as it's actually internally returning the en_US posts rather than the $locale ones, which explains the need for the MAX() in the first place.

I think the only way to do it "right" in SQL is a join against itself.. I'm not sure how to achieve it in Elastic Search either.. deduplicating in PHP might be a better option, especially as most queries seem to be unlimited in result size right now

dd32 avatar Jul 21 '21 02:07 dd32

I've switched focus to some higher priority things for a bit, so unassigning.

I may come back to this when I get back to #28, but feel free if anyone wants to move it forward in the meantime.

iandunn avatar Jul 26 '21 16:07 iandunn