YoastSEO.js icon indicating copy to clipboard operation
YoastSEO.js copied to clipboard

Overview Issue: Refactor keyword-based assessments to accommodate morphology

Open nataliashitova opened this issue 7 years ago • 3 comments

Update: This issue description was updated following the feedback of Omar and Joost.

Current versions of keyword-based assessments rely on exact matches between the keyword and the text. However, with morphological support exact matches make less sense. Hereafter I outline my suggestions on possible adjustments of existing assessments.

For languages that have morphological support (for now, English), by word I understand the exact word from the keyphrase and all its forms generated internally. For languages without morphological support, word is understood as exact match with the keyphrase.

By function words I understand prepositions (e.g., for), articles (e.g., the), auxiliaries (e.g., were) and words of diminished or absent semantics (e.g., thing), i.e., all words that currently are listed under function words for prominent words analysis. By content words I understand all words which are not function words. E.g., content words in "The boy has eaten an apple" would be "boy", "eaten", "apple".

Group 1: one-word matches

TextImages

Alt-attributes of images should have keyword. Current: If < 5 images, GOOD if at least 1 alt-tag with the keyword. If 5 images, GOOD if 2-4 alt-tags with the keyword. If >5 images, GOOD if the number of alt-tags with keyword is within 30-75% range. BAD if there are no images. OK otherwise. Proposal: Consider any content word in the keyphrase, same otherwise.

Group 2: some- and all-word matches

TitleKeyword

The title should reflect the topic of the copy. Current: GOOD if the keyword is in the beginning of the title, OK if it is in the title, but not in the beginning, BAD otherwise. Proposal: GOOD if an exact match of the keyword is found in the beginning of the title, OK if all content words from the keyphrase are in the title, BAD otherwise.

IntroductionHasKeyword

The topic of the copy should be clear immediately. Current: GOOD if the keyword is in the first paragraph of the copy, BAD otherwise. Proposal: GOOD if all content words from the keyphrase are matched within one sentence in the introduction, OK if all content words are present in the first paragraph at all (but not in the same sentence), BAD otherwise.

MetaDescriptionKeyword

The keyword should be in the metadescription, but not too much. Current: GOOD if the keyword occurs once or twice, BAD otherwise. Proposal: Same as IntroductionHasKeyword, but count the number of matches to be 1 or 2.

SubheadingsKeyword

The topic should be clear from subheadings, but overuse is penalized. Current: GOOD if the keyword is in 30-75% subheadings, BAD otherwise. Proposal: A subheading is considered to reflect the topic if > half of content words from the keyphrase are used in it. Then, GOOD if 30-75% of subheadings reflect the topic, BAD otherwise.

UrlKeyword

The URL should reflect the topic of the copy. Current: GOOD if the keyword is in the URL, OK otherwise Proposal: If the keyword has 1 or 2 content words: GOOD if all content words are in the URL, OK otherwise. If the keyword has >2 content words: GOOD if > half of content words are in the URL, OK otherwise.

PreviouslyUsedKeyword

More than two articles should not try to rank for the same keyword. Current: GOOD if the keyword was never used, OK if it was used once, BAD otherwise. Proposal: Two ways. First, keep the current implementation with exact matches. Second, match based on base forms of every content word in the keyphrase.

Group 3: Density and distribution

KeywordDensity and TopicDensity###

The keyphrase should occur in the text often enough, but not too much. Current: GOOD if the keyphrase constitutes 0.5-3% of words (for 1-word keyphrase without synonyms) or 1.5-4% (for 1-word keyphrase with synonyms), BAD otherwise. The percentages are normalized by the length of the keyphrase. Proposal: A match is when all words from the keyphrase are found within one sentence. If all words from the keyphrase are matched multiple times, these are considered as multiple matches and are fed into keywordDensity accordingly. Examples: Text: "A boy was eating an apple and reading a book." Keyphrase: "books for boys". Matches found: 1. Text: "A boy was eating an apple. He was reading a book." Keyphrase: "books for boys". Matches found: 0. Text: "A boy was reading a book, which was a book for boys" Keyphrase: "books for boys". Matches found: 2.

KeywordDistribution

The keyword should be evenly distributed over the copy. Current: GOOD if the minimal distance between keyword occurrences is <40% of the text length, okay if it is between 40-50%, bad if >50%. Proposal: (1) For every sentence: if keywordLength < 4, GOOD if all content words from the keyphrase are in the sentence, OK if some but not all are, BAD if none; if keyword >= 4, GOOD if all content words from the keyphrase are in the sentence, or at least 3 content words from the keyphrase are in the sentence and the rest are in the neighbour sentences, OK if only some content words from the keyphrase are found in the sentence but not all, BAD if none. (2) Step function: start with the first third of the text (based on the total number of sentences) and calculate an average score over all sentences in this set. Move down by one sentence, calculate an average score again. Continue until the end of the text is reached. (3) For every step calculate an eventual punishment if not all content words from the keyphrase were used in the step. The punishment is either 0 (if all content words were used) or 0.5 if not all keyphrase words were used. The punishments are then averaged over all steps. (4) Compute a Gini coefficient over all steps. The Gini coefficient shows how uniform the distribution is ranging from 0 for a perfectly uniform distribution to 1 for a horribly un-uniform distribution. Add the averaged punishment (which is a value from 0 to 0.5). Calculate the scores: GOOD if the final score <- 0.4, OK if between 0.4 and 0.6, BAD otherwise.

Group 4: Other

KeyphraseLength

The keyphrase should be present and should not be too long. Current: GOOD if the keyphrase length is between 1-4, OK if between 5-8, BAD otherwise. Proposal: Same but count only content words.

PLAN

Import refactored assessments from feature/recalibration

  • [x] Migrate SEO assessments from feature/recalibration. Part 1. #1591
  • [x] Migrate SEO assessments from feature/recalibration. Part 2. #1592
  • [x] Update SEOAssessorSpec (including cornerstone) to follow the new format #1599
  • [x] Change taxonomy assessor in wordpress-seo repository (calls to classes, new file names) https://github.com/Yoast/wordpress-seo/issues/10400

Implement morhological researchers

  • [x] Implement research that generates Keyword+Synonyms structure including morphology #1587
  • [x] Implement research that checks how many keyphrase words are present within the string. #1634
  • [x] Implement research that calls the previous research only for keyphrase or for keyphrase and synonyms. #1635

Implement Premium morphology interface

  • [x] Implement morphological data imports from the server #1641
  • [x] Memoize the keyphrase/synonymsForms structure to speed up assessments #1750
  • [x] Remove regex and exception files which were moved to a single JSON #1751
  • [x] Transition the regex/exception JSON file to Yoast/YoastSEO.js-premium-configuration #1758
  • [x] Create authenticated downloads in MyYoast. https://github.com/Yoast/my-yoast/issues/1918
  • [x] Pass YoastSEO.js premium config to YoastSEO.js https://github.com/Yoast/YoastSEO.js/issues/1809
  • Make sure that Free doesn't have access to the morphology functionality

Refactor assessments to implement morphological support

  • [x] Refactor IntroductionHasKeyword assessment #1644
  • [x] Refactor TitleKeyword assessment #1638
  • [x] Refactor KeyphraseLength assessment #1753
  • [x] Refactor UrlKeyword assessment (also note #1608) #1650
  • [x] Refactor Keyword density assessment #1785
  • [x] Refactor TextImages assessment #1642
  • [x] Refactor TopicDistribution assessment #1755
  • [x] Refactor MetaDescriptionKeyword assessment #1754
  • [x] Refactor SubheadingsKeyword assessment #1756
  • [x] Refactor TextCompetingLinks assessment to include morphology #1825

Final checklist for 9.0

  • [x] All assessments return new feedback strings
  • [x] The morphology data is removed from YoastSEO.js
  • [x] The license is decided upon https://github.com/Yoast/YoastSEO.js-premium-configuration/issues/3
  • [x] The issues from Morpho-Syno milestone are all merged

Stretch goals

  • [ ] Refactor PreviouslyUsedKeyword assessment #1752
  • [ ] Remove topicCount if it's not needed anymore
  • [ ] Guess base form of every word and list it as the first on in the array of forms. Merge topic words if they have the same base form (only needed if someone is using a content word in the keyphrase twice)

For an overview on how morphology works and how to adjust existing assessment to include mophology consult this wiki article.

nataliashitova avatar Jun 25 '18 09:06 nataliashitova

@jdevalk @omarreiss Could you confirm that is the way to go?

nataliashitova avatar Jul 04 '18 09:07 nataliashitova

I Agree with a lot. I see a couple of things @jdevalk needs to decide about.

1. Title

Proposal: Good if an exact match of the keyword is found, OK if all content words from the keyphrase are in the title, BAD otherwise.

In the proposal it no longer seems to matter if the keyword is at the beginning of the title. Do you agree with this?

2. First paragraph

If the keyphrase has 4 or more content words: GOOD if 3 content words from the keyphrase are matched within one sentence in the introduction, while the rest are found in the neighbour sentences, OK if all content words are present in the first paragraph at all (but not in the same sentence), BAD otherwise.

I think this is going to be a bit hard to explain. I'd rather not differentiate here and always require all content words to be matched in at least one sentence for a GOOD.

3. Subheadings

Proposal: A subheading is considered to reflect the topic if > half of content words from the keyphrase are used in it. Then, GOOD if 30-75% of subheadings reflect the topic, BAD otherwise.

I would prefer a definition where a subheading needs to include all of the content words in order to reflect the topic.

4. Urls

Proposal: If the keyword has 1 or 2 content words: GOOD if all content words are in the URL, OK otherwise. If the keyword has >2 content words: GOOD if > half of content words are in the URL, OK otherwise.

This needs an SEO's perspective.

5. Keyphrase length

Current: GOOD if the keyphrase length is between 1-4, OK if between 5-8, BAD otherwise. Proposal: Same but count only content words.

This would make it much less strict. From a content words perspective the current calibration is probably already Good: 1-3, OK: 4-6, BAD: otherwise.

omarreiss avatar Jul 06 '18 14:07 omarreiss

  1. Title

Proposal: Good if an exact match of the keyword is found, OK if all content words from the > keyphrase are in the title, BAD otherwise.

In the proposal it no longer seems to matter if the keyword is at the beginning of the title. Do you > agree with this?

No. Beginning of title matters.

  1. First paragraph

If the keyphrase has 4 or more content words: GOOD if 3 content words from the keyphrase are matched within one sentence in the introduction, while the rest are found in the neighbour sentences, OK if all content words are present in the first paragraph at all (but not in the same sentence), BAD otherwise.

I think this is going to be a bit hard to explain. I'd rather not differentiate here and always require all content words to be matched in at least one sentence for a GOOD.

Agree with @omarreiss

  1. Subheadings

Proposal: A subheading is considered to reflect the topic if > half of content words from the keyphrase are used in it. Then, GOOD if 30-75% of subheadings reflect the topic, BAD otherwise.

I would prefer a definition where a subheading needs to include all of the content words in order to reflect the topic.

That's undoable for larger keyphrases. I think I'm fine with @nataliashitova's suggestion but this needs to be tested on real copy.

  1. Urls

Proposal: If the keyword has 1 or 2 content words: GOOD if all content words are in the URL, OK otherwise. If the keyword has >2 content words: GOOD if > half of content words are in the URL, OK otherwise.

This needs an SEO's perspective.

I'm fine with this.

  1. Keyphrase length

Current: GOOD if the keyphrase length is between 1-4, OK if between 5-8, BAD otherwise. Proposal: Same but count only content words.

This would make it much less strict. From a content words perspective the current calibration is probably already Good: 1-3, OK: 4-6, BAD: otherwise.

I'm fine with less strict for this. So let's go with @nataliashitova's suggestion.

jdevalk avatar Jul 17 '18 08:07 jdevalk